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Foreword 


Over  the  past  three  years  I  have  been  working  on  probabilistic  models  of  multivariate  attributes 
and  relations.  My  work  suggests  a  probabilistic  framework  and  a  general  modeling  approach  to 
complex  and  evolving  networks,  which  is  based  on  the  four  concepts  of  mixed  membership,  motifs, 
dynamics  and  integration.  In  this  thesis,  I  present  such  a  framework  and  discuss  its  properties.  In 
particular,  the  main  goal  of  the  research  is  to  establish  the  essential  elements  of  formal  models  of 
complexity  that  reconcile  the  global  properties  of  a  system  with  local  phenomena  of  interest,  in  a 
generative  fashion. 

A  solution  to  the  global/local  trade-off  is  to  express  complexity  through  hierarchical  mixtures 
of  simple  patterns,  i.e.,  motifs,  that  evolve  over  time.  Complex  global  behavior  emerges  from  the 
combination  of  local  interaction  patterns  and  their  dynamics.  I  discuss  the  extent  to  which  this 
novel  framework  incorporates,  generalizes,  and  extends  other  probabilistic  approaches  present  in 
the  literature,  and  argue  that  it  provides  the  foundations  of  a  statistical  theory  of  random  graphs. 

A  major  part  of  the  effort  is  devoted  to  the  analysis  of  modeling  issues  related  to  the  four 
essential  aspects  listed  above,  in  the  context  of  applications  to  social  and  biological  networks.  I 
also  investigate  theoretical  and  computational  issues  such  as  the  geometrical  intuition  of  the  latent 
allocation  task — an  important  inference  objective  shared  by  the  various  models  encompassed  by 
this  framework. 
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Chapter  1 


Introduction 


This  thesis  provides  a  methodological  framework  for  the  statistical  analysis  of  complex  graphs  and 
dynamic  networks.1  In  it,  I  develop  probabilistic  algorithms  that  generate,  evolve  and  integrate 
a  heterogeneous  collection  of  graphs,  I  study  the  statistical  models  these  algorithms  implicitly 
specify,  and  I  develop  strategies  for  estimating  the  set  of  quantities  on  which  they  depend  in  the 
context  of  applications  to  social  and  biological  networks. 


1.1  Complex  Data 

My  investigations  concern  a  population  of  objects  of  study.  Objects  can  be  divided  into  few  dif¬ 
ferent  categories,  or  types,  e.g.,  gene,  proteins,  and  stable  protein  complexes;  or  documents,  words 
and  references;  or  agents,  tasks,  and  resources.  Observations  consist  of  measurements  taken  on 
individual  objects,  i.e.,  attributes,  and  on  pairs  of  objects  i.e.,  relations.  Both  attributes  and  rela¬ 
tions  are  typically  multivariate,  e.g.,  the  expression  of  a  gene  under  many  experimental  conditions, 

'The  terms  graph  and  network,  without  qualifications,  are  synonyms  for  the  purposes  of  this  thesis  because 
throughout  I  represent  networks  in  terms  of  graphs. 
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or  the  set  of  words  contained  in  a  document.  Measurements  are  taken  over  time,  and  distributed 
across  heterogeneous  databases. 

At  any  given  epoch,  each  object  is  represented  as  a  node  in  a  graph.  Relations  are  represented 
as  edges  in  the  graph,  among  nodes  (i.e.,  objects)  of  the  same  type,  whereas  attributes  are  rep¬ 
resented  as  edges  in  the  graphs,  among  nodes  of  different  types.  From  a  statistical  perspective, 
it  is  sometimes  convenient  to  consider  the  random  matrices  corresponding  to  the  various  graphs 
at  hand;  that  is,  the  adjacency  matrices  whose  elements  are  scalar  random  variables  that  encode 
edge-weights.  This  is  the  perspective  I  favor  throughout  this  thesis. 

Example  1.  Figure  1.1  shows  a  collection  of  heterogeneous  observations  that  we  may  use  to  gain 
insight  into  the  biology,  say,  of  yeast.  The  collection  involves  four  different  object  types:  proteins, 
genes,  experimental  conditions,  and  functional  annotations.  Relations  are  defined  as  measure¬ 
ments  on  pairs  of  objects  of  the  same  type:  i.e.,  the  matrices  PP,  GG,  CC,  and  AA.  Attributes 
are  defined  as  measurements  on  pairs  of  objects  of  different  types,  e.g.,  entries  of  the  matrix  GC 
measure  the  expression  of  genes  under  the  various  experimental  conditions. 

Such  an  integrated  view  of  the  data  available  for  a  given  scientific  problems  invites  us  to  think 
about  the  semantics  and  the  substance  of  the  relations  among  object  types  in  the  context  of  the 
application  at  hand,  independently  of  whether  or  not  observations  about  them  are  available.  This 
process  is  beneficial  as  it  is  often  suggestive  of  new  research  directions. 

Example  2.  Proteins  are  composed  of  one  or  more  subunits.  In  turn,  each  subunit  is  composed 
of  one  or  more  linear  polypeptide  molecules,  which  are  polymers  of  twenty  different  amino  acids, 
i.e.,  residues.  Amino  acids  can  be  modified  once  they  have  been  incorperated  into  a  polypeptide 
and  the  presence  of  these  modifications  may  have  a  strong  influence  on  the  functionality  of  the 
final  protein  molecule.  Such  modifications  are  called  post-translational  ( Alberts  et  al.,  2002).  The 
matrix  PG  in  Figure  1.1  encodes  the  mapping  between  proteins  and  gene  tags  after  translation — 
the  matrix  GG  encodes  correlations  between  microarrays  expression  profiles  of  gene  pairs.  Should 
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Proteins  Genes 


000-0 


PP 


000  -0 


PG 


GG 


Exp.  Conditions  Functional  Annotations 

000  -  O  OOO  -  O 


Figure  1.1:  An  example  of  complex  data.  The  figure  shows  an  integrated  (but  partial)  view  of 
the  observations  about  a  biological  system.  The  types  of  objects  are  proteins,  genes,  experimental 
conditions,  and  functional  annotations.  The  various  rectangles  are  suggestive  of  the  matrices  that 
encode  the  edge  weights  of  the  networks  (unipartite  and  bipartite)  among  pairs  of  nodes  of  the 
corresponding  types. 


post-translational  modifications  be  negligible,  we  should  see  a  one-to-one  mapping  between  genes 
and  proteins.  Figure  1.1  suggests  a  way  to  get  at  such  a  mapping,  which  provides  an  alternative 
to  what  has  been  proposed  in  the  literature  (Tsur  el  al.,  2005).  That  is,  we  could  estimate  the 


15 


1.1.  COMPLEX  DATA 


E.M.  AIROLDI 


mapping  between  proteins  and  gene  tags  by  looking  at  the  consistency  of  the  interactions  between 
stable  protein  complexes  underlying  the  protein-to -protein  network  encoded  in  the  matrix  PP  and 
the  interactions  underlying  the  gene-to-gene  network  in  encoded  in  the  matrix  GG. 

Other  instances  of  complex  data  arise  in  diverse  applications  such  as  artificial  intelligence 
(Carley,  2002),  biology  (Troyanskaya  et  al.,  2003;  Airoldi  and  Xing,  2006b),  information  retrieval 
(Barnard  et  al.,  2003),  natural  language  processing  (Griffiths  et  al.,  2005;  Kontorovich  et  al.,  2006), 
and  statistical  network  analysis  (Airoldi  et  al.,  2007a). 

Example  3.  The  ancdysis  of  large  collections  of  scientific  publications  involves  a  different  set 
of  object  types:  authors,  documents,  words  and  references  ( treated  as  documents’  attributes). 
Nonetheless,  the  the  data  can  be  represented  with  a  similar  set  of  matrices,  e.g.,  documents-to- 
words,  documents-to-references,  and  authors-to-authors.  Depending  on  the  availability  of  data 
and  on  the  scientific  questions  of  interest,  researchers  typically  focus  on  one,  or  at  most  a  few,  of 
such  matrices  (Erosheva  et  al.,  2004;  Airoldi  et  al.,  2006e). 

Example  4.  The  analysis  of  a  dynamic  communication  network  involves  object  types  such  as  em¬ 
ployees,  emails  and  words.  The  data  can  be  represented  with  a  set  of  matrices  very  similar  to  those 
in  Example  3.  If  the  analysis  takes  place  within  a  corporate  environment,  we  may  involve  more 
object  types  such  as  tasks,  resources.  The  matrices  involving  these  new  types,  e.g.,  employees-to- 
tasks,  employees-to-resources,  tasks -to -tasks,  and  tasks-to-resources,  would  enrich  our  represen¬ 
tation  of  the  inner  workings  of  the  company,  and  allow  to  ask  a  different,  possibly  more  interesting, 
set  of  questions  ( Carley,  2002 ). 

1.1.1  Abstract  Representations 

From  a  modeling  perspective,  it  is  useful  to  complement  the  discussion  above  with  an  overview 
of  the  data,  and  how  they  are  represented.  Consider  a  collection  of  relations  measured  on  pairs  of 
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nodes,  and  a  collection  of  attributes  measured  on  the  same  set  of  nodes.  Such  collections  can  be 
represented  by  a  unipartite  graph  and  by  a  bipartite  graph,  respectively.  I  choose  notation  that  is 
suggestive  of  the  elements  of  the  random  matrices  that  encode  the  edge  weights  in  these  graphs, 
rather  than  the  more  standard  notation  based  on  vertices,  edges,  and  a  map  from  edges  to  edge 
weights.  A  matrix  representation  of  a  such  pair  of  unipartite  and  bipartite  graphs  is  given,  for 
example,  by  the  matrices  PP  and  PG  in  Figure  1.1.  Relations  correspond  to  edges  in  the  unipartite 
graph  PP,  and  connect  paris  of  objects  of  the  same  type,  e.g.,  proteins — the  only  type  of  objects 
in  PP.  Attributes  correspond  to  edges  in  the  bipartite  graph  PG,  and  connect  pairs  of  objects  of 
different  types,  e.g.,  proteins  and  gene-tags.  The  set  of  attainable  values  of  relation  and  attribute 
measurements  is  application  specific.2 

For  example,  a  collection  of  relations  is  denoted  by 

G\-n  ={  Gr  \  r  =  1, . . . ,  R  } , 

where  the  index  r  runs  over  R  replicates.  Each  unipartite  graph  Gr  =  ( Yr,Af ),  is  defined  by  a  set 
of  edge  weights,  Yr,  over  a  fixed  set  of  nodes,  A f,  e.g.,  the  proteins  in  PP.  The  random  quantities 
that  encode  the  edge  weights  between  pairs  of  nodes  (n,  m)  in  Af  0  Af  are  denoted  by  yr{n ,  m), 
and  take  values  in  a  separable,  metric  space.3  I  refer  to  Yr  as  a  random  matrix  whenever  Yr(n.  m) 
takes  values  in  M  for  all  edges  in  £.  A  collection  of  attributes  is  denoted  by 

Hi-.r  ={Hr:r  =  l,...,R}, 

where  the  index  r  runs  over  R  replicates.  Each  bipartite  graph  Hr  =  (Xr:A /i:2),  is  defined  by  a  set 
of  edge  weights,  Xr,  over  two  fixed  sets  of  nodes  of  different  types,  A/j  and  A f2,  e.g.,  the  proteins 

2Always  a  separable  (i.e.,  contains  a  countable,  dense  subset)  metric  space. 

3It  is  possible  to  introduce  the  set  of  edges,  and  define  Yr  as  a  mapping  from  edges  to  edge  weights — an  unnecessary 
complication  at  this  stage. 
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and  the  gene-tags  in  PA.  The  random  quantities  that  encode  the  edge  weights  between  pairs  of 
heterogeneous  nodes  (n,  m)  in  M\  <8>  A4  are  denoted  by  xr(n.  m),  and  take  values  in  a  separable, 
metric  space.  When  dealing  with  both  unipartite  and  bipartite  graphs  for  which  M  =  Af\,  e.g., 
that  is  the  case  for  PP  and  PA,  it  is  sometimes  convenient  to  denote  the  set  of  attributes  in  A/2  by  a 
collection  of  node-specific  random  quantities  x[.N(m),  where  N  is  the  number  of  nodes  in  A f,  and 
m  is  one  of  the  M  distinct  attributes  in  A/2 — the  replicate  index  r  has  been  moved  to  the  exponent 
for  clarity. 


1.2  Goals  of  the  Analysis 

I  distinguish  two  main  types  of  analyses:  descriptive  versus  predictive.  In  a  descriptive  analysis  the 
goal  is  to  find  a  model  that  captures  the  variability  of  the  observations  with  high  probability — in 
terms  of  the  estimates  for  the  underlying  constants,  and  in  terms  of  the  inferred  distributions  over 
the  latent  quantities  involved.  In  a  predictive  analysis  the  goal  is  to  find  a  model  that  is  good  at 
predicting  a  specific  set  of  attributes  or  relations  from  another  set  of  attributes  or  relations.  The 
ability  of  such  a  model  in  replicating  the  variability  of  the  observations  may  be  sacrificed  in  this 
case,  since  estimates  and  distributions  assign  high  probability  to  the  data  do  not  necessarily  lead 
to  good  predictions.  The  divergence  of  objectives  between  descriptive  and  predictive  analyses 
is  analogous  to  that  between  the  probabilistic  versions  of  principal  component  analysis  (Jolliffe, 
1986;  Tipping  and  Bishop,  1999)  and  Fisher’s  linear  discriminant  analysis  (Fisher,  1936,  1938; 
Ripley,  1996). 

The  models  I  consider  in  this  thesis  are  slightly  more  complex.  They  posit  a  hierarchy  of  prob¬ 
abilistic  assumptions  on  the  way  observables,  ( Y ,  X ),  and  non- observables,  S,  related  to  objects  of 
different  types  are  generated,  and  depend  upon  and  interact  with  one  another.  Given  these  assump¬ 
tions,  the  models  summarize  the  complexity  of  the  observations  in  terms  of  a  set  of  latent  patterns. 
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Patterns  are  defined  in  terms  of  a  set  of  parameters,  0,  which  are  also  non-observable  but  which 
are  semantically  distinct  from  S.  Small  sets  of  underlying  constants,  e.g.,  A  and  B.  sit  at  the  top 
of  the  hierarchy,  constrain  the  space  of  non-observable  quantities,  (S,  0),  and  ultimately  constrain 
the  likelihood  of  the  observations, 


(1.1) 


Likely  patterns,  0,  and  likely  values  of  other  non-observable  quantities,  S,  are  searched  for,  and 
found,  in  the  data.  They  may  be  used  for  organizing  and  simplifying  complex  information,  deter¬ 
mining  object  similarity,  detecting  outliers,  and  making  predictions  about  attributes  of,  and  rela¬ 
tions  among,  the  objects  involved. 

The  analyses  supported  by  these  models  boil  down  to  a  subset  of  four  fundamental  tasks:  (1) 
allocation,  that  is,  the  search  for  a  likely  mapping  of  objects  to  patterns;  (2)  estimation,  that  is, 
the  search  for  likely  values  of  the  underlying  constants;  (3)  inference,  that  is,  the  search  for  likely 
values  of  patterns  and  other  non-observables;  (4)  prediction,  that  is,  the  search  for  likely  values  of 
attributes  and  relations  that  need  be  predicted.  For  example,  testing  hypothesis  about  the  existence 
of  a  specific  pattern,  90  e  0,  can  be  carried  out  by  building,  e.g.,  a  plug-in  confidence  region  1Z  = 
72.(0)  for  0,  and  checking  whether  60  belongs  to  it.  The  tasks  relevant  to  a  specific  analysis  can 
be  carried  out  simultaneously,  since  the  relevant  quantities,  i.e.,  observables,  non-observables,  and 
underlying  constants,  are  tied  together  in  a  hierarchy  of  probabilistic  assumptions  by  the  generative 
algorithm. 

Example  5.  Consider  the  output  of  a  battery  of  microarray  experiments  on  the  same  set  of  genes, 
A f,  under  different,  R,  experimental  conditions,  inyeast  (Krogan  et  al.,  2006).  Proteins  are  uniquely 
identified  by  genes  in  the  microarray  experiments.  Without  entering  into  biological  details,  I  wish 
to  analyze  probabilities  of  interactions  between  pairs  of  proteins,  which  are  induced  from  cor¬ 
relations  found  in  the  gene  expression  experiments  (Bhardwaj  and  Lu,  2005).  Information  about 


19 


1 .2.  GOALS  OF  THE  ANALYSIS 


E.M.  AIROLDl 


this  unique,  symmetric  relation  can  be  stored  in  a  collection  of  R  square  tables,  one  for  each 
experimental  condition,  whose  entries  are  random  variables  with  support  [0, 1]  that  encode  the 
probability  of  an  interaction  between  corresponding  pairs  of  proteins.  The  analysis  of  the  set  of 
protein-protein  interactions  aims  primarily  at  identifying  stable  protein  complexes,  i.  e. ,  clusters  of 
proteins,  since  they  have  been  shown  to  be  important  for  carrying  out  cellular  processes.  Further, 
the  number  of  protein  complexes  that  are  needed  to  explain  the  collection  of  protein  interactions 
needs  be  identified.  Lastly,  the  probabilities  according  to  which  pairs  of  such  protein  complexes 
interact  with  one  another  need  be  estimated. 

An  aspect  of  the  methodology  that  is  relevant  to  the  discussion  here  is  the  presence  in  the  pro¬ 
posed  models  of  non-observables,  S,  which  encode  semantic  elements  of  interest  in  a  specific  ap¬ 
plication ,  e.g.,  the  stable  protein  complexes  of  Example  5.  This  implies  that  such  non-observables 
are  potentially  measurable,  and,  typically,  few  measurements  about  them  are  available — or  can  be 
made  available  at  a  cost.  A  special  attention  is  given  in  the  analyses  to  such  latent  quantities,  and  to 
other  attributes  or  relations  that  are  measured  with  error,  e.g.,  experimental  evidence  or  human  an¬ 
notators  disagree  on  their  values,  on  a  small  portion  of  the  objects  of  study,  e.g.,  they  are  expensive 
to  obtain.  I  will  often  refer  to  partially  available  measurements  about  such  attributes,  relations  and 
non-observables  as  labels.  The  portion  of  labeled  data  available  is  of  interest  for  the  estimation  of 
the  prediction  error,  and  an  explicit  error  model  for  the  label  is  often  desirable.  Further,  depending 
on  the  amount  of  labeled  data  available,  different  strategies  for  initializing  the  inference,4  for  fitting 
the  underlying  constants,  and  for  inferring  the  distributions  on  the  latent  quantities  given  the  data 
may  be  adopted. 

Example  6.  Consider  the  set  of  hand-curated  protein  interactions  produced  by  the  Munich  Institute 
for  Protein  Sequencing  (Mewes  etal.,  2004).  A  single  set  of  interactions  between  proteins  has 

been  experimentally  verified.  Information  about  this  unique,  symmetric  relation  can  be  stored  in 

4Differences  that  have  important  consequences  on  the  interpretability  of  the  estimates  and  of  the  inferred  distribu¬ 
tions  on  the  latent  quantities. 
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one  square  table,  whose  entries  are  random  variables  with  support  {0, 1}  that  encode  presence  or 
absence  of  an  interaction  between  corresponding  pairs  of  proteins. 


1.3  Basic  Modeling  Elements 

There  are  few  central  modeling  ideas  that  inform  the  probabilistic  algorithms  presented  in  the  fol¬ 
lowing  chapters.  These  ideas  generalize  model  specifications  that  were  used  to  gain  insight  into 
fundamental  problems  of  computational  biology,  i.e.,  serial  analysis  of  gene  expression  (Airoldi  et  al., 
2006f)  and  protein  interaction  networks  (Airoldi  et  al.,  2006c),  and  into  the  analyses  of  large  col¬ 
lections  of  scientific  publications  (Airoldi  et  al.,  2006e)  and  of  dynamic  communication  networks 
(Airoldi  and  Faloutsos,  2004;  Airoldi  et  al.,  2005d).  They  relate  to  the  following  four  aspects  of 
complex  data:  (1)  the  presence  of  a  hierarchical  structure  in  the  likelihood,  which  includes  both  ob¬ 
servable  and  non-observable  random  quantities,  (2)  the  mixed  membership  assumption,  according 
to  which  objects  may  participate  in  multiple  latent  patterns  to  different  degrees,  (3)  the  temporal 
dimension,  and  (4)  the  existence  of  multiple  data  types,  and  conditional  dependencies  among  their 
distributions,  in  an  integrated  system. 

These  aspects  are  best  illustrated  below  by  discussing  how  they  generalize  popular  data  analysis 
models  such  as  probabilistic  principal  component  analysis  (PPCA),  factor  analysis  (FA),  and  state- 
space  models  (SSM). 

1.3.1  Hierarchy  and  Latent  Patterns 

Let  us  consider  a  collection  of  attributes,  H  =  (X,A f-.f),  and  let  us  adopt  the  point  of  view  of 
multivariate  attribute  measurements,  x\ ,n,  on  the  N  objects  in  A f\  about  the  M  objects  in  Af2. 

Example  7.  The  data  generating  process  for  X  underlying  factor  analysis  is  instantiated  by  a 
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probabilistic  algorithm,  Al  :  (A/i:2,  K,  A,  'I')  — >  RA/, 

1.  For  each  object  neJVi 

1.1.  Sample  the  latent  factors  fn  ~  Normal  k  (0, 1) 

1.2.  Sample  the  error  en  ~  Normal  m  (0,  'F) 

1.3.  Define  the  multivariate  attribute  xn  =  Aon  +  en, 

where  K  is  typically  referred  to  as  the  number  of  (scalar)  factors,  A  is  a  deterministic  matrix  of 
factor  loadings,  and  is  an  unconstrained  variance-covariance  matrix.5  The  algorithm  suggests 
a  hierarchical  decomposition  of  the  joint  probability  distribution  of  the  attributes,  X  =  X\:n,  and 
the  latent  factors,  0  =  (4>i-.n,G:n),  given  a  set  of  underlying  constants,  A  =  (A,  \l>);  that  is, 
the  integrand  in  Equation  1.2.  By  integrating  the  latent  variables  out  of  the  joint  we  obtain  the 
likelihood  of  the  observations, 


(1.2) 


where  f\  and  /2  are  K-  and  M -dimensional  Gaussian  densities,  respectively. 

In  FA  the  latent  factors  are  an  example  of  patterns,  the  way  I  intend  them;  they  are  non¬ 
observable  random  quantities,  defined  in  terms  of  a  set  of  scalar  parameters.  Depending  on  the 
model,  patterns  may  specify  other  mathematical  objects  such  as  probability  distributions,  smooth 
curves,  and  surfaces. 

Confusion  may  arise  about  the  notation  for  patterns,  0,  and  underlying  constants,  A,  in  those 
cases  where  latent  patterns  are  defined  to  be  deterministic.  In  such  case  the  patterns  would  occupy 
a  spot  at  the  top  of  the  hierarchy,  similarly  to  the  underlying  constants,  thus  leaving  us  the  choice  to 

5Note  that  PPCA  differs  from  FA  only  in  that  the  variance-covariance  matrix  of  the  errors,  e) .  y ,  is  homoschedastic, 
that  is,  'P  =  a2 1  with  a  scalar. 
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include  0  in  A  or  not.6  I  shall  clarify  the  use  of  notation  whenever  the  occasion  requires  it.  Further, 
I  note  that  the  latent  patterns,  0,  and  other  non-observable  random  quantities  with  a  semantic 
interpretation,  H,  that  appear  in  the  general  formulation  of  Section  1.2,  are  to  be  interpreted  as 
part  of  a  hierarchical  likelihood  since  they  typically  model  substantive  elements  of  interest  in  the 
application  at  hand. 

Example  8.  A  simple  mixture  of  spherical  Gaussians  for  xi-.n  can  be  specified  by  the  following 
probabilistic  algorithm,  A2  :  (A/i:2,  K ,  jli-.K,  ^i-.k)  — > 

1.  For  each  object  n  G  Mi 

1.1.  Sample  the  latent  component  indicator  in  ~  Uniform  (1, .  . . ,  K) 

1.2.  Sample  the  multivariate  attribute  xn  ~  Normal  M  (bin, 

where  K  is  the  number  of  mixture  components,  (fii-.K,  ^i-.k)  are  the  corresponding  mean  vectors 
and  variance-covariance  matrices,  and  Efc  =  of  I  with  o  scalar.  The  likelihood  can  be  written  as 
in  Equation  1.2,  where  the  attributes  X\-.n  are  denoted  by  X,  the  latent  component  indicators 
by  0,  and  the  underlying  constants  £i ,K)  by  A.  In  this  case  where  f\  and  /2  are  discrete 

uniform  and  M -dimensioned  normal  densities,  respectively. 

In  the  example  above,  the  underlying  constants  (fii-.K,  ^i-.k)  qualify  as  patterns.  It  is  conceiv¬ 
able  to  put  probabilistic  constraints  on  such  quantities,  e.g.,  by  assuming  that  they  are  generated 
from  some  distributions.  By  doing  so,  I  would  introduce  a  new,  more  parsimonious  set  of  under¬ 
lying  constants,  A,  and  promote  0  =  (fii-K,  £i .k)  to  be  the  non-observable,  probabilistic  patterns 
of  the  general  formulation  of  Section  1 .2. 

6It  could  be  argued,  for  instance,  that  the  matrix  of  factor  loadings.  A,  should  be  considered  a  part  of  the  patterns 
underlying  a  set  of  attribute  measurements,  X,  as  much  as  the  factors,  0,  themselves.  However,  in  the  hierarchical 
formulation  I  consider  here,  it  is  not  difficult  to  imagine  the  use  of  a  probabilistic  model  for  A  to  endow  the  loadings 
with  some  desirable  property  (Airoldi  and  Lin,  2006). 
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Thus  a  generative  algorithm  and  the  corresponding  hierarchical  likelihood  specify  exactly  how 
the  various  quantities  of  interest  interact,  in  a  probabilistic  fashion,  and  encode  structural  hypoth¬ 
esis  of  the  scientist. 

1.3.2  Mixed  Membership 

The  idea  of  mixed  membership  extends  that  of  a  mixture.  Stated  briefly,  this  assumption  posits 
that  the  collection  of  measurements  involving  an  object,  i.e.,  both  relations  and  attributes,  may 
be  ultimately  explained  in  terms  of  multiple  patterns  to  different  degrees.  A  recurring  element  of 
my  models  is  that  such  representations  of  latent  patterns  are  associated  with  the  components  of 
a  mixture,  as  in  the  example  above.  In  this  sense,  both  mixture  models  and  mixed-membership 
models  aim  at  describing  the  aggregate  variability  of  a  set  of  measurements  in  terms  of  a  small  set 
of  latent  patterns.  There  are  two  major  salient  differences,  however,  between  a  mixture  model  and 
a  mixed  membership  model. 

(i)  In  a  mixture  model  the  membership  of  objects  to  patterns  is  specified  in  terms  of  global 
weights.  In  a  mixed-membership  model  the  membership  of  objects  to  patterns  is  specified  in 
terms  of  object-specific  weights;  these  give  a  low-dimensional  representation  of  the  objects 
that  can  be  used  for,  e.g.,  making  predictions  about  object-specific  quantities. 

(ii)  Measurements  in  a  mixed  membership  model  can  be  associated  with  more  than  one  latent 
patterns.  The  role  of  sparsity  in  this  context  is  to  impose  soft  constraints  in  the  estimation  of 
the  mapping  between  objects  and  latent  patterns — I  term  this  estiamtion  the  allocation  task. 

For  instance,  relations  or  attribute  of  an  employee  in  Example  4  may  be  explained  in  terms  of 
the  latent  patterns  associated  with  more  than  one  social  group,  and  the  interactions  between  two 
individual  proteins  can  be  explained  by  their  taking  part  into  more  than  one  stable  protein  complex. 
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Example  8  provides  us  with  another  element  of  the  general  formulation  of  Section  1.2,  that  is, 
the  set  of  non-observables  that  encodes  semantic  elements  of  interest.  For  instance,  in  the  appli¬ 
cation  to  dynamic  networks  described  in  Example  4,  or  in  the  application  to  gene  co-expression 
networks  in  Airoldi  and  Xing  (2006b),  the  Gaussian  mixture  components  are  suggestive  of  latent 
social  groups  and  latent  stable  protein  complexes,  respectively.  In  these  specific  applications, 
the  non-observables  S1:Ar  encode  the  single  membership  to  a  social  group,  or  to  a  stable  protein 
complex — whereas  the  latent  patterns  0  provide  parametric  representations  of  social  groups  and 
complexes. 

Example  9.  Going  back  to  the  previous  example,  the  M  scalar  components  of  each  multivariate 
attribute  xn  are  no  longer  constrained  to  be  sampled  from  the  same  latent  pattern,  i.e.,  from  the 
same  mixture  component,  (/7/c,  T,k).  The  new  algorithm,  A3,  which  instantiates  the  mixed  member¬ 
ship  of  K  spherical  Gaussians  is  as  follows, 

1.  For  each  object  n  G  Af\ 

1.1.  For  each  attribute  m  G  Ah 

1.1.1.  Sample  the  latent  component  indicator  in  ~  Uniform  (1, . . . ,  K ) 

1.1.2.  Sample  the  scalar  attribute  xn(m)  ~  Normal  (pin(m),  crin(m)). 

Alternatively,  it  is  possible  to  illustrate  this  richer  mapping  between  observables  measured  on  an 
object  and  mixture  components  entailed  by  the  mixed  membership  assumption  in  terms  of  a  general 
form  for  the  likelihood.  That  is,  we  can  rewrite  the  likelihood  as  an  admixture  for  each  univariate 
measurement.  The  mixture  likelihood  in  Equation  1.2  can  be  specified  as, 

£{X\A)  =\{  J  f1(e\A)f2(xn\Q,A)dS,  (1.3) 
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whereas  the  admixture  likelihood  corresponding  to  Algorithm  A3  can  be  rewritten  as, 

£{X\A)  =Y[  I  fi(Q\A)f2{xn{m)\e,A)  dG,  (1.4) 

n,m 

where  f\  and  f2  are  Gaussian  densities  of  appropriate  dimensionality.  Note  that  this  is  also  the 
case  for  PPCA,  but  not  for  FA — because  of  possible  structure  in  4'.  In  PPCA,  however,  the  space 
of  multivariate  attributes  is  not  to  the  convex  cone  spanned  by  the  K  non-observable  quantities, 
because  the  factor  loadings  do  not  lie  in  the  K -dimensioned  simplex — this  is  the  case  with  fi, 
which  is  a  probability  distribution  on  the  K  latent  patterns,  Q\-k- 


A  final  note  concerns  the  mapping  between  observations  and  mixture  components.  Application- 
specific  features  of  the  mapping  itself  are  typically  of  interest,  since  they  impact  the  latent  patterns 
found,  and  the  results  of  the  analysis  more  in  general.  One  of  the  features  often  supported  by 
the  data  is  that  such  a  mapping  is  sparse;  that  is,  each  univariate  measurement  can  be  ultimately 
explained  in  terms  of  a  few  latent  patterns  (i.e.,  mixture  components).  Further,  in  many  applications 
the  mapping  is  skewed;  that  is,  that  many  univariate  measurements  can  be  ultimately  explained  in 
terms  of  a  few  patterns,  whereas  a  few  univariate  measurements  can  be  explained  in  terms  of  many 
of  them. 

Lastly,  it  is  important  to  recognize  that  the  choice  between  alternative  specifications  (e.g.,  para¬ 
metric,  semi-parametric,  or  ad-hoc)  about  the  number  of  non-observable  quantities,  K ,  in  these 
models  is  not  a  matter  of  mathematical  elegance.  Such  a  choice  typically  has  a  non-negligible 
impact  on  the  substantive  findings  and  their  interpretation,  therefore  it  should  be  motivated  and 
discussed  in  terms  of  the  specific  scientific  problem  of  interest,  by  the  amount  of  information 
available  about  such  non-observable  quantities,  as  well  as  by  the  goals  of  the  analysis  (e.g.,  ex¬ 
ploratory  versus  conclusive).  For  example,  Krogan  et  al.  (2006)  find  that  the  average  size  of  stable 
protein  complexes  is  about  five  proteins.  That  suggests  the  existence  of  a  larger  number  of  such 
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non-observable  complexes,  prior  to  any  analysis  of  a  new  set  of  proteins,  V,  than  popular  semi- 
parametric  model  specifications  would  entail;  that  is,  0(\P\)  rather  than  0(log  \P\). 

1.3.3  Dynamics  and  Evolution 

Several  models  of  dynamic  behavior  exist  in  the  literature,  which  can  be  used  to  model  the  evolu¬ 
tion  of  latent  patterns  for  a  finite  number  of  epochs,  ©(1:T). 

Example  10.  A  linear,  Gaussian  state-space  model  extends  the  factor  analysis  model  of  Example 
7,  by  linearly  evolving  the  latent  factors  from  one  epoch  to  the  next.  The  data  generating  process 
for  X(1:T)  is  as  follows, 

1.  At  epoch  t  —  0 

1.1.  For  each  object  n  G  Af\ 

1.1.1.  Sample  the  latent  factors  ~  Normal  k  (0, 1) 

1.1.2.  Sample  the  error  ~  Normal  m  (0,  vh) 

1.1.3.  Define  the  multivariate  attribute  xPP  =  Aon  +  ep\ 

2.  At  epoch  0  <  t  <T 

2.1.  For  each  object  n  G  Mi 

2.2.1.  Evolve  the  latent  factors  fiPP  =  F  <$£  l), 

2.2.2.  Sample  the  error  ePP  ~  Normal  m  (0,  VI;) 

2.2.3.  Define  the  multivariate  attribute  xPP  =  AfiPP  +  ePP , 

where  F  is  a  (I\  x  K)  matrix  that  encodes  the  dynamics  of  the  latent  factors.  As  before,  the  algo¬ 
rithm  suggests  a  hierarchical  decomposition  of  the  joint  probability  distribution  of  the  attributes, 
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Figure  1.2:  Graphical  representations  of  a  factor  analysis  model  (left)  and  of  a  state-space  model 
for  observations  at  two  consecutive  epochs  (right).  White  nodes  denote  non-observables,  whereas 
shadowed  nodes  denote  observables. 

X(]11 )  =  x[]p,  and  the  latent  factors,  0(  1 ;  / 1  =  ((jp.'P ,  G-'p),  given  a  set  of  underlying  constants, 
A  =  ( F.  A,  'F)  that  does  not  change  over  time.1  The  likelihood  is  then, 

£{X^\A)  =  j  f^O^lA)  /2(X(O)|0(O),^)  x  (1.5) 

x  (  n/o(0W|0(t_1),^)  h{xW\e®,A) )  d0(1:T), 

where  f\  and  fi  are  K-  and  M -dimensional  Gaussian  densities,  respectively,  and  fo  is  the  deter¬ 
ministic  transformation  in  Step  2.2.1.  of  the  data  generating  process.  A  graphical  representation 
of  FA  and  SSM  is  given  in  Figure  5.1,  which  highlights  the  simple  connection  between  the  two 
models. 

In  my  models,  I  consider  three  flavors  of  dynamics: 

(1)  a  generalized  state-space  process  (Brockwell  and  Davis,  1991;  Xing,  2005a),  possibly  non¬ 
linear  and  non-Gaussian,  which  provides  my  models  with  a  fully-parametric,  tractable  dy- 

7The  dynamic  matrix  F  may  be  easily  modeled  as  time  dependent  and/or  stochastic,  as  the  problem  requires 
(Airoldi  and  Faloutsos,  2004;  Airoldi  et  al.,  2005d). 


28 


CHAPTER  1.  INTRODUCTION 


E.M.  AIROLDl 


namics; 

(2)  a  latent  birth-death  process  (Karr,  1991;  Xing,  2005b;  Airoldi  and  Xing,  2006a)  that  allows 
for  a  possibly  infinite  number  of  patterns  (semi-parametric)  and  generates  complex  pattern 
dynamics;  and 

(3)  a  co-evolutionary  process  (Carley,  1990,  1991)  that  induces  highly  non-linear  dynamics  by 
tying  together  the  temporal  behaviors  of  observables  and  non-observables. 

Technical  issues  arise  with  the  increasing  complexity  of  the  dynamical  behaviors  above. 

As  an  alternative,  it  is  possible  to  specify  temporal  patterns  directly,  as  a  part  of  (5,  0).  Such 
a  modeling  strategy  allows  us  to  consider  a  sequences  of  observations  about  objects  as  being 
expressed  as  an  admixture  of  complicated  patterns,  specified  in  a  parametric  or  non-parametric 
fashion,  while  avoids  technical  issues  that  arise  when  the  estimation  of  an  explicit  dynamics  is 
considered. 

1.3.4  Integration 

Integration  of  the  measurements  on  relations  and  attributes  involving  objects  of  different  types 
may,  and  will,  take  many  forms  in  the  models  considered  throughout  this  thesis,  and  it  seems 
unnecessary  to  list  them  all  at  this  stage.  It  will  suffice  to  distinguish  two  types  of  integration,  one 
relates  to  descriptive  versus  predictive  analyses,  and  the  other  relates  to  the  integration  of  labels. 

Example  11.  Consider  the  following  generative  process  for  a  set  of  relations  G  =  (Y,Af)  among 
objects  in  A f,  a  set  of  multivariate  attributes  Hx  =  ( X ,  A/",  T)  on  the  same  set  of  objects,  and  a  set 
of  labels  Hz  =  ( Z ,  A/",  C). 

1.  Sample  the  mixed-membership  scores  for  objects  in  A f  according  to 
(7?1  :N)  ~  fl(^\A) 


29 


1.3.  BASIC  MODELING  ELEMENTS 


E.M.  AIROLDI 


2.  Sample  the  latent  pattern  indicators  for  object-specific  relations  and  attributes,  indepen¬ 
dently  and  given  their  mixed-memberships,  according  to 

(Iy,Ix)  ~  f2{I\'tti:]sr) 

3.  Sample  the  observations  given  object-specific  patterns  according  to 
{Y,X)~  f3(Y,X\Ix,IY,Q,A) 

4.  Sample  predictive  indicators  for  the  objects’  labels  from  the  corresponding  sets  of  object- 
specific  pattern  indicators  that  were  sampled  according  to 

( Iz )  ~  lx) 

5.  Sample  the  labels  given  the  predictive  indicators  according  to 
Z  ~  Generalized  Linear  Model  (Z\Iz,A) 

where  5  =  (tti-.n,  Iy,  lx,  Iz),  and  the  latent  patterns  0  are  deterministic.  Relevant  to  the  dis¬ 
cussion  here  is  the  composition  of  the  relations  and  attributes  (Y,  X)  as  independent  sources  of 
information  in  Steps  2-3,  versus  the  composition  of  the  labels  Z  as  conditional  source  of  informa¬ 
tion  in  Steps  4-5. 

In  a  descriptive  analysis,  sets  of  non-observables  corresponding  to  different  data  sources  always 
contribute  equally  to  the  data  generation,  and,  in  turn,  observables  always  inform  equally  the  in¬ 
ference  process  about  the  corresponding  sets  of  non-observables.  This  is  what  happens  with  the 
relations  and  attributes  (Y,  X)  in  Example  11  and  with  the  corresponding  latent  pattern  indicators 
for  relations  and  attributes  (Jy,  I x )  ■  In  a  predictive  analysis,  one  set  of  non  observables  always 
contributes  to  the  data  generation  conditionally  on  the  values  assumed  by  a  second  set  of  observ¬ 
ables,  and,  in  turn,  the  two  sets  of  observables  inform  the  inference  process  about  non-observables 
unequally — namely,  the  information  the  latter  set  contributes  to  the  inference  process  is  used  to 
describe  residual  variability ,  which  cannot  be  explained  by  information  contributed  by  the  former 
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set  of  observables.  This  what  happens  to  the  labels  Z  in  Example  1 1  with  the  corresponding  latent 
pattern  indicators  Iz. 


1.4  Overview  of  the  Research 

Complexity  of  the  observations  is  resolved  into  hierarchical  mixtures  of  simple  patterns  that  evolve 
over  time,  i.e.,  complex  global  behavior  emerges  from  the  combination  of  local  (i.e.,  object- 
specific)  interaction  patterns  and  dynamics.  This  solution  provides  a  principled  approach  to  rec¬ 
oncile  global  properties  the  system  with  local  phenomena  of  interest.  Structured  models  similar 
to  those  shown  in  Figure  5.1  are  often  referred  to  as  hierarchical  models  in  the  statistics  litera¬ 
ture  (Kass  and  Steffey,  1989;  Gelman  et  al.,  1995).  Estimation  techniques  include  empirical  Bayes 
(Morris,  1983;  Carlin  and  Louis,  2005)  and  full  Bayesian  methods  (Mosteller  and  Wallace,  1964, 
1984;  Airoldi  et  al.,  2006a).  The  general  model  formulation  I  explore  in  this  thesis  subsumes  many 
probabilistic  models  present  in  the  literature,  provides  a  soft  and  probabilistic  version  of  many 
non-probabilistic  algorithms,  and  most  importantly  provides  the  essential  statistical  elements  for 
the  analysis  of  complex  data,  random  graphs  and  matrices,  and  dynamic  networks. 

In  Chapter  2, 1  survey  existing  algorithms  to  generate  popular  topologies  in  unipartite  graphs.  I 
then  present  proper  statistical  models  to  generate  such  topologies,  complete  with  likelihoods  and 
estimators  for  the  parameters  involved.  I  conclude  by  exploring  the  lognormal  and  cellular  graphs. 
In  Chapter  3, 1  describe  different  ways  to  search  for  patterns  and  mechanisms  underlying  networks. 
In  Chapter  4, 1  consider  attributes,  I  describe  an  extension  of  the  models  to  multivariate  attributes 
and  relations,  and  I  describe  strategies  to  integrate  complex  data  into  a  large  statistical  model. 
In  Chapter  5,  I  describe  models  of  evolution  for  attributes  and  relations.  Finally,  in  Chapter  6,  I 
explore  a  selection  of  theoretical  and  computational  issues  associated  with  the  general  formulation 
of  my  models  and  describe  aspects  of  future  research. 
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1.4.1  Contributions  of  this  Thesis 

This  thesis  develops  statistical  methodology  for  the  Bayesian  analysis  of  data  that  arise  in  studies 
about  complex  networks  and  their  evolution.  The  connection  between  modeling  choices  and  sub¬ 
stantive  issues  is  kept  at  the  forefront  of  the  discussion.  Furthermore,  complexity  in  the  various 
models  is  pursued  only  to  the  extent  that  it  responds  to  needs  that  are  rooted  in  the  data  and  the 
goals  of  the  analysis.  Such  a  focus  on  the  data  and  their  properties  is  compatible  with  the  develop¬ 
ment  of  a  general  modeling  framework  for  the  analysis  of  complex  and  evolving  networks  thanks 
to  the  central  role  played  by  few  essential  modeling  elements — described  in  Section  1 .3 — that  can 
be  used  to  describe  complex  dynamic  systems  in  general. 

1 .  In  many  applications  there  is  an  large  amount  of  information  available  with  a  temporal  or 
sequential  dimension.  Methods  that  explicitly  account  for  dynamics  and  evolution  of  the 
phenomena  under  scrutiny  are  much  needed.  Modeling  approaches  available  to  date,  for 
which  solid  inference  mechanisms  are  available,  include  hidden  Markov  models  and  gener¬ 
alized  state- space  models.  New  methodology  and  modeling  strategies  are  needed  that  can 
account  for  richer  evolutionary  patterns  of  complex  sets  of  measurements,  i.e.,  relational  and 
non-relational.  Furthermore,  there  is  a  need  for  producing  predictions  that  are  based  on  sev¬ 
eral  sources  of  information,  which  need  be  integrated;  a  solid  probabilistic  approach  to  this 
end  is  missing.  This  thesis  develops  a  modeling  framework  that  responds  to  these  needs. 

2.  The  models  that  can  be  specified  working  within  the  framework  proposed  in  this  thesis  are 
extremely  diverse  and  widely  applicable.  Many  scientific  studies  lead  to  data  sets  that  are 
represented  as  graphs,  at  some  level,  e.g.,  two-mode  data  lead  to  bipartite  graphs,  uni-modal 
data  to  graphs  where  we  record  relations  between  pairs  of  objects  of  the  same  type,  multi- 
mode  data  where  we  record  relations  among  objects  of  multiple  types,  multi-graphs  where 
edges  encode  multivariate  variables,  and  combination  of  these.  Assumptions  and  intuitions 
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of  interest  may  need  be  incorporated  in  application-specific  models,  but  the  modularity  of 
my  approach  makes  these  special  modeling  issues  a  piece  of  the  puzzle  that  can  be  addressed 
separately — by  instantiated  on  of  the  integration  strategies  of  Section  4.3 — on  top  of  a  set  of 
data  source-specific  models. 

3.  The  proposed  framework  subsumes  several  models  recently  proposed  in  the  machine  learn¬ 
ing  and  applied  statistics  literature  and  ties  them  together  within  a  general  formulation  that 
is  amenable  to  theoretical  analysis.  Therefore  the  proposed  framework  opens  new  analytical 
possibilities  by  allowing  us  to  address  theoretical  aspects  of  interest  in  terms  of  the  specifica¬ 
tions  of  the  general  formulation.  This  high-level  theoretical  analysis  disregards  the  nuances 
present  in  application-specific  models  and  focuses  on  fundamental  technical  issues,  such  as 
identifiability,  model  selection,  distribution  free  tests  for  assessing  the  goodness  of  fit,  the 
geometrical  understanding  of  allocation  tasks,  or  the  asymptotics  of  the  family  of  hierarchi¬ 
cal  mixed-membership  models.  Even  a  limited  explorations  of  these  issues  would  advance 
the  scientific  understanding  of  the  methodology.  This  exercise  will  ultimately  benefit  appli¬ 
cations  by  providing  theoretical  insights  to  support  application-specific  modeling  choices. 

The  grand  vision  is  to  establish  a  mature  statistical  theory  of  graphs  and  networks  that  can 
bridge  theoretical  computer  science,  a  largely  deterministic  discipline,  and  statistical  theory.  This 
can  be  achieved,  for  example,  by  explicitly  characterizing  the  relation  between  deterministic  and 
probabilistic  solutions  to  problems  shared  by  both  disciplines  that  involve  graphs  and  networks. 
The  ultimate  goal  is  that  of  promoting  the  role  of  statistical  Bayesian  theory  in  the  computing 
sciences  and  its  modern  applications. 

1.  This  thesis  provides  solid  foundations  of  a  statistical  theory  of  mixed-membership  and  ex¬ 
changeable-edge  models  of  graphs  and  networks  and  their  evolution.  Such  foundations  are 
missing,  to  the  best  of  my  knowledge.  This  is  a  goal  worth  pursuing  in  its  own  right. 
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2.  This  thesis  promotes  the  role  of  Bayesian  statistics  in  the  theoretical  computer  science  and 
data  mining  communities  by  providing  new  models  and  perspectives  in  applications  of  pri¬ 
mary  importance,  for  example,  to  biological  sequences  &  networks,  dynamic  social  net¬ 
works,  collections  of  scholarly  publications,  knowledge  and  corporate  networks,  and  home¬ 
land  security.  The  proposed  general  framework  aims  at  fostering  scientific  progress  by  serv¬ 
ing  as  a  glue  for  several  branches  of  the  literature  that  are  poorly  aware  of  one  another. 

In  conclusion,  recent  trends  and  events  suggest  an  imminent  shift  of  focus  of  the  research  com¬ 
munity  at  large  towards  complex  interacting  dynamic  systems,  along  with  a  rediscovered  mindset 
that  through  integration  we  can  finally  deliver  satisfactory  solutions  to  long-standing  real-world 
problems,  as  well  as  create  new  applications.  This  thesis  presents  methodology  derived  from  ap¬ 
plications  for  applications,  and  provides  insights  and  understanding  on  when  we  can  expect  the 
methods  we  employ  will  work,  and  why.  Specifically,  I  discuss  models  and  methods  that  enable 
applications  to  biological  databases,  collections  of  scientific  publications,  and  dynamic  social  and 
corporate  networks.  Success  stories  where  my  methods  were  key  to  answer  real-world  problems 
provide  the  background  for  the  discussion.  I  argue  that  my  efforts  establish  the  foundations  of  a 
statistical/computational  theory  of  complex  networks  and  their  evolution. 

1.4.2  Limitations 

This  thesis  develops  a  modeling  framework  to  tackle  specific  applications.  As  a  consequence  topics 
and  modeling  approaches  are  omitted  that  I  believe  are  important  for  the  analysis  of  complex  and 
evolving  networks.  I  am  currently  working  on  an  extension  of  the  modeling  framework  developed 
here  that  addresses  such  topics  and  modeling  approaches.  A  short  list  of  what  is  missing  in  this 
thesis  follows. 

•  Connections  to  statistical  theory  and  methodology  for  the  analysis  of  networks  that  does  not 
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involve  latent  variables  (Wasserman  and  Faust,  1994;  Wasserman  et  al.,  2007). 

•  A  fully  developed  example  of  how  the  statistical  methodology  developed  in  this  thesis  of¬ 
fers  a  principled  approach  to  tackle  calibration  and  validation  issues  that  arise  in  large-scale 
agent-based  models  and  simulations  of  complex  systems  (Carley,  1990, 1991;  Banks  and  Carley, 
1996;  Carley  et  al.,  2006;  Shreiber,  2006).  However,  I  outline  the  main  points  of  the  argu¬ 
ment  in  Section  5.2. 

•  Connections  to  random  matrix  theory  (Metha,  2004).  However,  I  devote  Section  3.2  to 
situate  in  the  context  of  this  thesis  some  of  the  recent  developments  in  the  field  that  bear 
relevance  to  statistical  network  analysis;  namely,  the  mathematical  analysis  of  diffusion 
(Coifman  et  al.,  2005a, b;  Nadler  et  al.,  2005;  Lafon  and  Lee,  2006). 

•  Generative  models  of  edge  patterns  (a.k.a.  network  motifs,  node  identity  does  not  matter, 
e.g.,  see  Milo  et  al.  2002),  as  opposed  to  the  generative  models  of  node  patterns  (node  iden¬ 
tity  matters,  e.g.,  see  Chapter  3)  developed  in  this  thesis.  At  the  model  level,  such  an  exten¬ 
sion  to  Bayesian  mixed  membership  models  of  edge  patterns  is  trivial.  However,  non-trivial 
computational  issues  arise  immediately,  for  example,  in  the  evaluation  of  the  likelihood — 
where  a  (combinatorial)  search  of  all  instances  of  the  relevant  edge  motifs  needs  be  per¬ 
formed,  i.e.,  sub-graph  matching. 

•  Models  of  complex  dynamic  behavior  such  as  latent  birth-death  processes  (Airoldi  and  Xing, 
2006a),  in  preparation,  and  duplication-attachment  processes  (Wiuf  et  al.,  2006). 

•  A  complete  analysis  of  the  mathematical  properties  of  exchangeable-edge  models  of  Sec¬ 
tion  2.2.  These  models  represent  an  important  extension  of  the  popular  random  graph 
model  of  Erdos  and  Renyi  (1959)  and  Gilbert  (1959),  technically,  by  involving  a  layer  of 
latent  variables.  Such  an  analysis  is  part  of  my  current  research  (Airoldi  and  Carley,  2006; 
Airoldi  and  Shalizi,  2006). 
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Chapter  2 


Random  Graphs  Revisited 


Here  I  survey  existing  algorithms  to  generate  popular  topologies  in  unipartite  graphs.  Proper  statis¬ 
tical  models  to  generate  such  topologies  are  then  presented,  complete  with  likelihoods.  I  conclude 
with  a  presentation  of  novel  probabilistic  algorithms  to  generate  lognormal  and  cellular  graph 
topologies,  along  with  their  analysis. 


Introduction  and  Motivation  In  order  to  shed  some  light  on  how  the  interactions  among  a  set  of 
objects  of  study,  e.g.,  people  or  proteins,  lead  to  the  emergence  of  observed  patterns  and  properties 
of  interest,  both  local  and  global,  e.g.,  groups  and  the  small-world,  several  generative  algorithms 
have  been  proposed.  These  algorithms  abstract  a  small  set  of  essential  features  of  the  objects  and 
interactions,  and  try  to  replicate  local  or  global  patterns  and  properties  of  interest — either  exactly 
or  approximately,  either  in  a  deterministic  fashion  or  with  high  probability.  We  consider  such 
algorithms  to  be  insightful  whenever  they  can  replicate  the  observable  phenomena  of  interest,  and 
the  small  set  of  essential  features  which  they  are  based  upon  suggests  us  a  possible  explanation  for 
them. 
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Example  12.  Milgram  (1967)  provided  empirical  evidence  in  support  of  the  so  called  small  world 
hyposthesis.  Briefly,  Milgram  instructed  a  set  of  people  ( i.  e. ,  sources)  in  Nebraska,  Kansas  and 
Massachussets  to  send  packets  to  any  one  of  two  specific  individuals  (i.e.,  targets)  in  Massachus- 
sets.  The  targets  were  described  approximately  in  terms  of  a  small  set  of  characteristics  such  as 
location,  profession,  and  other  demographics.  The  sources  were  supposed  to  send  the  packets  to¬ 
wards  the  target  by  sending  it  to  a  person  they  knew  on  a  first  name  basis,  i.e.,  to  an  acquaintance 
the  source  believed  to  be  closer  to  the  target.  The  game  consisted  in  delivering  the  packet  to  the 
target  with  as  few  of  these  first-name  links  as  possible.  If  the  small  world  hypothesis  held,  the 
average  lenght  of  the  first-name  chains  of  acquantances  that  connected  a  source  to  a  target  should 
be  independent  of  the  location  of  the  sources.  This  is  exactly  what  Migram  found,  the  median 
length  being  somewhere  around  six — an  independent  statistical  analysis  of  Mil  gram’s  data  that 
includes  incomplete  chains  suggests  six  to  be  a  serious  underestimate  of  the  actual  median  length 
(Fienberg  and  Lee,  1975).  In  an  abstract  setting,  we  can  represent  people  by  means  of  nodes  in  a 
graph,  and  acquaintances  by  directed  links  from  a  node  to  another.  The  scientific  phenomenon  of 
interest  is  the  small  world;  we  would  like  to  be  able  to  explain  it  with  a  simple  model  that  gener¬ 
ates  small  world  graphs  with  high  probability.  What  kind  of  generative  process  should  we  posit? 
Watts  and  Strogatz  (1998)  propose  a  rewiring  model  to  answer  this  question.  In  their  model  nodes 
are  embedded  in  a  metric  space,  (X,  d),  and  each  node  is  connected  to  its  neighbors  according  to  d 
by  means  of  undirected  edges.  Then,  with  probability  p  each  of  the  edges  that  connect  a  node  with 
its  d-neighbors  is  rewired  at  random,  that  is,  is  disconnected  from  a  d-neighbor  and  connected 
to  another  node  in  the  graph  with  equal  probability.  When  the  rewiring  process  is  carried  out 
for  each  node,  the  process  ends.  Although  very  simple,  the  rewiring  model  of  Watts  and  Strogatz 
(1998)  encodes  a  key  intuition  about  how  acquaintances  may  be  established,  that  is,  the  fact  that 
people  form  local  circles  of  friends,  and  retain  a  few  of  them  when  they  move  across  the  coun¬ 
try.  This  social  process  is  suggestive  of  the  rewiring  model.  It  turns  out  that  the  rewiring  of  local 
neighbors  alone  is  enough  to  generate  small  world  graphs  with  high  probability. 


38 


CHAPTER  2.  RANDOM  GRAPHS  REVISITED 


E.M.  AIROLDl 


Our  ability  to  spot  patterns  and  properties  of  interest  crucially  depends  on  the  graph  metrics 
available  to  us.  Many  metrics  have  been  proposed  over  the  years  to  measure  various  properties  of 
graphs  that  could  explain  phenomena  of  interest,  e.g.,  recurrent  connectivity  patterns,  average  path 
length,  or  degree  distribution.  Crucially,  each  of  these  metrics  pre-encodes  some  intuition  about 
the  phenomena  we  conjecture  may  exist.  In  fact,  such  metrics  are  meant  to  capture  numerical 
properties  of  those  graphs  where  the  phenomena  of  interest  occur,  that  are  absent  in  other  graphs. 
In  other  words,  we  can  only  attempt  to  measure  those  patterns  that  we  believe  are  distinct  from 
background  noise.  The  set  of  metrics  available  to  us  is  then  a  byproduct  of  substantive  intuitions 
about  what  we  cannot  see  or  measure.  In  a  technical  sense,  each  metric  encodes  a  structural 
hypothesis,  i.e.,  structural  bias. 

Example  13.  Milgram’s  (1967)  empirical  analysis  and  Watts  and  Strogatz’s  (1998)  theoretical 
analysis  use  different  metrics.  Milgram  considers  the  average  length  of  first-name  chains,  and 
finds  that  they  are  consistently  short.  In  particular,  such  a  short  average  length  (alternatively,  such 
a  small  diameter )  is  less  than  what  we  whould  expect  to  observe  were  the  acquaintance  network  a 
purely  random  graph  (Erdos  and  Renyi,  1959;  Gilbert,  1959).  However,  there  is  possibly  an  infinite 
number  of  ways  to  generate  graphs  with  a  diameter  smaller  than  that  of  a  random  graph.  Which 
properties  of  small  works  graphs  unequivocably  distinguish  them  from  others?  In  order  to  identify 
small  world  graphs,  Watts  and  Strogatz  consider  the  characteristic  path  length  (closely  related  to 
Milgram’s  metric)  and  the  clustering  coefficient.1  Stated  formally,  they  find  that  the  dimater  of  a 
graph  drops  from  0(n )  to  O(logn)  even  when  a  small  fraction  p  of  the  edges  is  rewired,2  whereas 
the  clustering  coefficient  remains  close  to  3/4.  The  rewiring  process  has  little  effect  on  several 
other  metrics  as  well. 

'in  Section  2.1  we  show  to  what  extent  these  two  metrics  can  distingush  graphs  with  a  small  world  topology  from 
graphs  with  other  topological  properties. 

2Although,  as  noted  by  Bollobas  and  Riordan  (2003),  this  fact  is  a  particular  instance  of  a  classic  result  of  random 
graph  theory  about  the  diameter  of  the  giant  component  (Erdos  and  Renyi,  1960),  there  is  much  merit  in  the  suggestive 
power  of  the  simple  generative  model  introduced  by  Watts  and  Strogatz  (1998),  who  place  such  result  in  a  context 
relevant  to  the  scientific  community  at  large. 
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Brief  Overview  of  Results  During  the  past  few  years  the  attention  of  the  scientific  community 
has  increasingly  focused  on  complex  graphs  and  dynamic  networks.  As  a  consequence,  many 
scientific  investigations  that  involve  graphs  to  some  degree  attempt  an  assessment  of  how  sensitive 
the  main  findings  are  to  topological  properties  of  the  graphs.  The  general  trend,  however,  is  for 
such  investigations  to  leverage  popular  models  of  graphs,  rather  than  focusing  on  their  topological 
properties. 

Example  14.  High  throughput  techniques  have  made  way  to  the  collection  of  data  on  many  com¬ 
plementary  aspects  of  the  biology  of  the  major  species  living  on  our  planet,  and  integral  ap¬ 
proaches  to  the  biological  sciences  are  now  possible.  As  a  consequence  of  this,  dependence  among 
observations,  data  and  model  integration,  and  ultimately  network  science,  have  become  funda¬ 
mental  to  our  ability  to  carry  on  the  process  of  scientific  discovery  in  this  domain.  Relevant  to  the 
discussion  here  is  the  fact  that  more  and  more  research  articles  on  complex  biological  networks, 
e.g.,  protein  interaction  networks,  gene  regulatory  networks,  and  metabolic  networks,  make  use 
of  popular  models  of  graphs  such  as  the  scale-free  model  (Barabasi  and  Albert,  1999),  which  is 
consistent  with  different  graph  topologies  (Bollobas  and  Riordan,  2003),  rather  than  investigating 
the  topological  properties  of  such  networks  directly. 

A  crucial  issue  is  then  the  mapping  between  popular  models  of  graphs  and  the  topological 
properties  they  possess.  In  particular,  under  scrutiny  is  whether  the  various  graph  topologies  are 
different;  if  so,  by  how  much;  and  whether  the  different  generative  algorithms  for  a  specific  graph 
topology  lead  to  the  same  topological  properties.  This  is  in  some  sense  a  chicken  and  egg  problem, 
since  our  ability  to  probe  the  space  of  topologies  is  limited,  as  discussed  above,  by  the  set  of  met¬ 
rics  we  use.  A  brief  exploration  of  these  issues  is  presented  in  Section  2.1.  I  find  that  the  popular 
models,  e.g.,  scale-free  and  core-periphery,  generate  graphs  with  similar  topological  properties  for 
non-pathological  values  of  the  relevant  underlying  constants.  Furthermore,  I  find  that  alternative 
models  that  supposedly  generate  graphs  with  non-distinguishable  topological  properties,  e.g.,  mod- 
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els  of  scale-free  graphs  by  different  authors,  can  be  easily  distinguished.  These  findings  prompt 
us  to  make  recommendations  about  how  to  provide  successful  assessments  of  the  sensitivity  of 
an  analysis  to  topological  properties  of  graphs.  They  also  suggest  that  real-world  graphs  may  be 
better  modeled  as  mixtures  of  these  popular  models  (Airoldi  and  Carley,  2005). 

Another  major  issue  is  that  the  popular  algorithms  that  have  been  proposed  in  the  literature  to 
generate  topologies  of  interest  can  replicate  local  and  global  phenomena,  but  have  no  place  for  the 
data.  That  is,  given  a  few  underlying  constants  such  algorithms  generate  observations  that  display  a 
certain  class  of  behavior,  but  it  is  never  specified  how  to  estimate  values  for  those  constants  so  that 
the  generated  behavior /zte  a  collection  of  data  the  best.  Being  able  to  make  a  good  use  of  the  data 
is  crucial  whenever  models  and  algorithms  have  to  support  analyses  of  real  data.  In  Section  2.2 
alternative  mathematical  representations  of  a  graph  are  discussed.  Within  such  context,  I  show  how 
it  is  possible  to  posit  a  general  class  of  probabilistic  models,  which  I  refer  to  as  exchangeable-edge 
models,  that  generate  graphs  with  pre-specified  topological  properties  of  interest,  and  at  the  same 
time  allow  for  formal  inference  and  estimation  procedures.  These  models  inform  novel  analyses 
of  lognormal  and  cellular  graphs. 

I  consider  lognormal  graphs  in  Section  2.2.2.  I  find  that  scale-free,  or  power-law,  graphs  provide 
us  with  a  first-order  approximation  to  graphs  in  this  class.  I  then  introduce  a  novel  generative  model 
for  lognormal  graphs  that  serves  to  show  how  models  in  this  class  (and  hence  scale-free  or  power- 
law  graphs)  may  arise,  and  find  that  they  do  in  two  interesting  sets  of  circumstances.  First,  I  find 
that  lognormal  graphs  may  arise  as  an  artifact  of  the  network  construction  process.  Namely,  they 
arise  in  situations  where  edges  are  set  between  pairs  of  objects  by  thresholding  correlation  among 
their  attributes  even  when  such  attributes  are  completely  independent.  In  that  sense,  lognormal 
graphs  are  an  artifact  due  to  the  way  we  measure  the  presence  or  absence  of  relations  among  objects 
in  a  population  of  interest.  Second,  I  find  that  lognormal  graphs  may  arise  as  a  consequence  of 
a  limiting  phenomenon  when  certain  conditions  hold  on  the  way  edges  are  established  between  a 
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new  object  and  existing  objects  in  the  graph;  namely,  by  means  of  a  multiplicative  process. 

I  consider  cellular  graphs  in  Section  2.2.3.  I  find  that  communities  form  because  of  the  joint 
effect  of  two  simple  factors:  (i)  exclusivity,  that  is,  the  need  for  allocating  resources  to  compet¬ 
ing  interests;  and  (ii)  homophily,  that  is,  the  fact  that  social  interactions  are  more  likely  to  occur 
between  individuals  who  share  interests  than  between  those  who  do  not.  Furthermore,  I  find  that 
communities  emerge  quickly  as  exclusivity  exceeds  a  certain  threshold. 


2.1  Stability  and  Separability  of  Metric  Embeddings 

The  context  behind  the  exploration  I  present  here  is  given  by  two  observations.  First,  the  popularity 
gained  by  generative  models  of  graphs  have  helped  establish  several  flavors  of  graph  topologies,  in 
the  scientific  community  at  large.  For  example,  few  scientists  today  are  unaware  of  notions  such  as 
scale-free  and  small-world  networks.  Because  of  their  popularity,  such  notions  are  often  leveraged 
in  published  research  literature — for  better  or  worse.  Second,  any  scientific  approach  to  modeling 
graphs  faces  the  technical  issue  of  which  minimal  set  of  features  can  be  used  to  characterize  a 
graph.  The  popular  answer  is  to  focus  on  a  set  of  real-valued  metrics  (i.e.,  functions  of  edges  and 
vertices  of  a  graph,  which  are  defined  to  capture  specific  topological  properties)  thus  characterizing 
the  graph  as  a  vector.3 

A  crucial  issue  is  then  to  study  the  mapping  between  popular  models  of  graphs  and  the  topolog¬ 
ical  properties  of  the  set  of  graphs  they  support.  In  particular,  under  scrutiny  is  whether  the  various 
graph  topologies  are  different;  if  so,  by  how  much;  and  whether  the  different  generative  algorithms 
for  a  specific  graph  topology  lead  to  the  same  topological  properties. 

3  In  this  thesis,  I  am  interested  in  characterizations  that  entail  a  one-to-one  map  with  the  space  of  graphs;  this  issue 
is  explored  in  detail  in  Section  2.2. 
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Table  2.1:  Summary  of  published  generative  algorithms. 


Type 

Algorithm 

Parameters 

1.1. 

Ring  Lattice 

0  =  (N,  A'0) 

2.1. 

Small  World  (Watts  and  Strogatz,  1998) 

0  —  {N,  K0,  Q0) 

2.2. 

Small  World  (Kleinberg,  1999a) 

Q  =  (N,K0,K1,R) 

2.3. 

Small  World  (Airoldi,  2005) 

0  =  (N,  K0,  Q0,  Qi,  R) 

3.1. 

Random  (Erdos  and  Renyi,  1959) 

0  =  (N,  M ) 

3.2. 

Random  (Gilbert,  1959) 

&  =  (N,  P) 

4.1. 

Core-Periphery  (Borgatti  and  Everett,  1999) 

©  —  {N,P,  To) 

4.2. 

Core-Periphery  (Airoldi  and  Carley,  2005) 

©  —  (N,  P,  Pq) 

5.1. 

Scale  Free  (Barabasi  and  Albert,  1999) 

©  =  (N,  P,  N0,  P0) 

5.2. 

Scale  Free  (Airoldi  and  Carley,  2005) 

0  =  (N,  M,  R ) 

6.1. 

Cellular  (Airoldi  and  Carley,  2005) 

©  =  (N,  P,  B,  PB,  R) 

2.1.1  Experimental  Evidence 

Airoldi  and  Carley  (2005)  survey  various  flavors  of  graph  topologies,  or  topology  types,  along  with 
popular  published  algorithms  that  have  been  introduced  to  generate  them;  introduce  novel  algo¬ 
rithms  that  support  a  more  diverse  set  of  graphs,  in  terms  the  variability  of  their  topological  prop¬ 
erties;  and  address  the  case  of  cellular  graph  topology.  See  Figure  2.1  for  examples  of  the  various 
topologies  considered.  Here,  I  briefly  report  on  two  statistical  studies:  (i)  on  the  stability  of  topo¬ 
logical  properties  of  graphs  to  alternative  algorithms  that  have  been  proposed  to  generate  the  same 
topology  type;  and  (ii)  on  the  separability  of  graphs  with  distinct  topology  types — characterized  by 
a  set  of  18  metrics  for  the  analysis  of  graphs,  widely  adopted  in  the  social  and  physical  sciences. 


Stability  of  Topological  Properties  to  Variations  in  the  Algorithms  The  stability  study  is  tar¬ 
geted  towards  the  three  most  popular  notions  of  random,  small  world,  and  scale  free  (also  known 
as  power-law)  topologies.  The  overall  plan  is  simple.  First,  for  each  of  these  three  topologies,  I 
will  use  the  proposed  algorithms  to  generate  a  collection  of  graphs,  and  I  will  label  them  accord¬ 
ingly  to  the  specific  algorithm  variant  that  was  used.  Then,  I  will  assess  how  well  it  is  possible 
to  distinguish  graphs  that  were  generated  by  algorithm  variants  with  a  batch  of  cross-validation 
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1 .  Ring  Lattice  (RL) 


2.  Small  World  (SW) 


3.  Erdos  Random  (ER) 


4.  Core  Periphery  (CP) 


5.  Scale  Free  (SF) 


6.  Cellular  (CEL) 


Figure  2.1:  A  glance  at  the  relevant  topologies  on  a  ring.  Note  that  in  a  ring  there  is  a  natural 
notion  of  distance  that  is  distinct  from  the  one  entailed  by  shortest  paths,  i.e.,  the  distance  between 
nodes  A  and  B  is  proportional  to  the  arc-length  that  joins  them,  along  the  circle  outlined  by  the 
ring. 


experiments.  I  use  a  factorial  experimental  design;  ten  graphs  were  generated  for  each  parameter 
configuration,  and  parameter  configurations  were  set  by  defining  a  grid  on  the  support  of  each 
parameter  and  then  picking  all  combinations  (Airoldi  and  Carley,  2005,  Table  2).  Graphs  were 
generated  among  250  nodes,  and  the  choice  of  parameter  configurations  was  further  informed  by 
controlling  the  density  of  edges.  The  rationale  behind  these  choices  is  to  make  it  hard  for  the  clas¬ 
sification  algorithms  to  distinguish  graphs  based  upon  the  scale  and  the  density  of  the  generated 
graphs.  As  a  consequence,  it  is  conceivable  that  any  evidence  of  discriminatory  power  may  be  due 
to  differential  topological  properties  of  the  graphs  generated  by  alternative  algorithm  variants. 
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I  find  that 

(i)  Extremal  statistics  (i.e.,  minimum  and  maximum)  are  good  discriminators  between  the  two 
algorithm  variants  that  generate  random  graphs.  The  cross-validated  accuracy  is  in  the  high 
90%  and  this  comes  as  no  surprise  given  that  Algorithm  3.1  of  Table  2.1  leads  to  graphs  with 
an  exact  density,  whereas  Algorithm  3.2  does  not. 

(ii)  Properties  of  the  degree  distribution  are  good  discriminators  between  algorithm  variants  that 
generate  scale  free  graphs',  the  cross-validated  accuracy  is  in  the  mid  nineties;  this  can  be  ex¬ 
plained  by  the  non-negligible  effect  of  the  initial  graph  that  is  needed  to  initialize  Algorithm 
5.1,  and  it  is  consistent  with  the  analysis  of  Bollobas  and  Riordan  (2003). 

(iii)  Centrality  of  nodes  and  clustering  coefficient  are  fairly  good  discriminators  between  algo¬ 
rithm  variants  that  generate  small  world  graphs',  the  cross-validated  accuracy  is  in  the  mid 
eighties  when  we  try  to  distinguish  between  sample  graphs  of  Algorithms  2.1  and  2.2  or 
between  Algorithms  2.2  and  2.3,  and  it  drops  to  the  high  seventies  when  we  try  to  distin¬ 
guish  between  Algorithms  2.1  and  2.3.  This  is  expected  since  Algorithms  2.1  and  2.3  lead 
to  graphs  with  more  variable  neighborhood  structures  than  Algorithm  2.2. 

The  published  generative  algorithms  described  in  Table  2.1  play  a  role  in  this  small  experiment 
for  which  they  were  not  designed.  They  were  originally  proposed  to  illustrate  mechanisms  of  ag¬ 
gregation  suggestive  of  social  and  artificial  regularities  underlying  observed  phenomena.  Because 
of  their  popularity,  however,  the  scientific  community  is  adopting  these  mechanisms  for  analyses 
that  are  very  different  in  scope  from  those  they  were  intended  for.  I  find  this  practice  dangerous, 
both  in  terms  of  the  reproducibility  of  the  analyses,  and  in  terms  of  the  support  such  simplistic 
algorithms  can  offer  to  substantive  conclusions.  This  is  the  message  I  mean  to  offer  with  the 
experiments  presented  here. 
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Table  2.2:  Pairwise  comparisons;  entries  quote  the  errors  achieved  in  discriminating  graphs  gener¬ 
ated  according  to  pairs  of  algorithms.  Errors  are  estimated  using  cross-validation.  The  size  is  fixed 
at  250  nodes;  the  density  is  controlled  by  design. 


Lattice 

Random 

Small  World 

Scale  Free 

Cellular 

Core-Per. 

Lattice 

N/A 

0.2700 

0.0745 

0.00 

0.00 

0.00 

Random 

0.00 

0.4122 

0.2794 

0.3255 

0.25 

Small  World 

0.2478 

0.0866 

0.1312 

0.0531 

Scale  Free 

0.0007 

0.2645 

0.3333 

Cellular 

0.1746 

0.3715 

Core-Per. 

0.50 

Popular  Notions  of  Graphs  Topologies  and  their  Topological  Properties  The  separability 
study  is  motivated  by  another  question  related  to  data  analysis.  Often  times,  funding  agencies 
and  scholars4  find  it  interesting  to  investigate  the  following:  given  a  collections  of  measurements 
on  pairs  of  objects  in  a  population  of  interest,  what  is  the  popular  notion  of  topology  that  best 
represents  such  observations?  The  question  seems  to  imply,  or  assume,  that  topology  types  are 
uniquely  defined  in  terms  of  few  topological  properties — those  which  the  corresponding  models 
are  based  upon.  The  goal  of  the  separability  study  is  to  assess  the  extent  to  which  this  is  an  ill-posed 
question. 

I  performed  two  batches  of  experiments,  whose  results  are  reported  in  Tables  2.2  and  2.3.  The 
first  batch  of  experiments  considered  about  6000  graphs,  generated  according  to  the  factorial  exper¬ 
imental  design  used  in  the  stability  study,  but  extended  to  all  the  algorithms  in  Table  2.1.  The  size  of 
all  graphs  was  set  at  250  nodes,  and  the  density  was  controlled  by  design.  The  results  are  presented 
in  Table  2.2,  in  the  form  of  a  pair-wise  comparisons.  Each  entry  gives  the  errors  achieved  in  dis¬ 
criminating  graphs  generated  according  the  corresponding  pair  of  published  algorithms — diagonal 
entries  report  the  errors  corresponding  to  the  stability  study  discussed  above.  Errors  were  obtained 
with  a  naive  Bayes  classifier;  similar  results  were  obtained  with  decision  trees,  support  vector  ma¬ 
chines,  and  logistic-regression.  The  second  batch  of  experiments  considered  about  40000  graphs, 

4I  omit  citations  here,  although  it  is  easy  to  identify  notable  examples. 
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Table  2.3:  Pairwise  comparisons  and  best  three  discriminators;  entries  quote  the  errors  achieved 
in  discriminating  graphs  generated  according  to  published  algorithms.  Errors  are  estimated  using 
cross-validation,  with  the  quoted  topological  measures  as  the  unique  features.  The  size  and  density 
are  variable,  not  controlled  by  design. 


1  st  Property 

2nd  Property 

3rd  Property 

Random  vs. 
Small  World 

Net.  Constr.  (max) 
0.2740 

Connectedness 

0.2960 

Net.  Constr.  (dev) 
0.3800 

Random  vs. 
Scale  Free 

Eig.  Cnt.  (min) 
0.3690 

Close.  Cnt.  (min) 
0.4040 

Inv-Close.  Cnt.  (min) 
0.4100 

Random  vs. 
Cellular 

Eig.  Cnt.  (avg) 
0.3110 

Inv-Close.  Cnt.  (min) 
0.3140 

Close.  Cnt.  (min) 
0.3170 

Random  vs. 
Core-Per. 

Centrality  (dev) 
0.3470 

Centrality  (max) 
0.3500 

Close.  Cnt.  (dev) 
0.3530 

Small  World 
vs.  Scale  Free 

Net.  Constr.  (max) 
0.2590 

Eig.  Cnt.  (min) 

0.2720 

Eig.  Cnt.  (avg) 

0.3410 

Small  World 
vs.  Cellular 

Net.  Constr.  (max) 
0.1380 

Connectedness 

0.1750 

Eig.  Cnt.  (avg) 

0.2200 

Small  World 
vs.  Core-Per. 

Net.  Constr.  (max) 
0.2400 

Connectedness 

0.2620 

Centrality  (max) 

0.2870 

Scale  Free 
vs.  Cellular 

Centrality  (max) 
0.2860 

Close.  Cnt.  (max) 
0.2860 

Inv-Close.  Cnt.  (max) 
0.3060 

Scale  Free 
vs.  Core-Per. 

Centrality  (dev) 
0.3480 

Close.  Cnt.  (dev) 
0.3530 

Connectedness 

0.4170 

Cellular 
vs.  Core-Per. 

Centrality  (max) 
0.2250 

Inv-Close.  Cnt.  (max) 
0.2520 

Close.  Cnt.  (max) 
0.2580 
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sampled  from  a  pool  of  one  million  graphs  generated  according  to  the  full  factorial  block-design  of 
Frantz  and  Carley  (2005b),  which  makes  use  of  all  the  algorithms  in  Table  2.1.  In  the  sample,  the 
size  of  the  graphs  is  controlled  for,  and  so  is  the  density.  The  results5  are  presented  in  Table  2.3. 
Each  entry  provides  the  discrimination  errors  achieved  with  a  decision  tree;  similar  results  were 
obtained  using  a  naive  Bayes  classifier,  support  vector  machines,  and  logistic  regression. 

The  analysis  of  the  results  of  both  separability  studies  suggest  that  (i)  the  generative  algo¬ 
rithms  presented  in  Table  2.1  often  lead  to  unrealistic  variability  profiles  for  specific  metrics  over 
a  fairly  large  range  of  parameter  values — either  by  design  or  by  construction;  (ii)  as  we  consider 
the  collections  of  graphs  generated  according  to  popular  notions  topology  types  in  a  large  space  of 
topological  properties,  the  boundary  between  pairs  of  topology  types  is  not  sharp ,  and  most  of  the 
graphs  display  mixed  characteristics. 

2.1.2  Discussion 

The  experimental  evidence  suggests  that  scientific  questions  about  the  data  that  rely  on  popular 
notions  of  graph  topologies  have  to  be  treated  carefully.  Topology  types  are  operationally  de¬ 
fined  by  specific  data  generating  processes  that  were  devised  to  illustrate  the  effects  of  compelling 
aggregation  mechanisms  on  a  small  set  of  topological  properties  of  a  graph.  The  proposed  sta¬ 
tistical  analysis  of  such  algorithms  shows  that  (i)  alternative  options  available  for  generating  the 
same  topology  type  are  distinguishable  in  terms  of  the  set  topological  properties  they  entail,  and 
(ii)  processes  that  supposedly  lead  to  distinct  topologies  types,  actually  generate  graphs  that  share 
many  topological  properties.  Therefore,  while  these  algorithms  deliver  insights  about  phenomena 
of  interest,  it  is  dangerous  to  employ  them  for  other  purposes,  as  it  is  often  done  in  practice.  In  the 
context  of  statistical  testing,  for  example,  those  algorithms  may  be  used  to  produce  p- values  for 
metrics  of  interest.  Topological  properties  of  a  graph  under  the  null  hypothesis  can  be  evaluated 

5I  wish  to  thank  Ian  C.  Fette  for  facilitating  this  study  by  sharing  useful  code. 
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by  sampling  graphs  according  to  one  of  the  popular  algorithms  listed  in  Table  2.1.  Concluding,  the 
experimental  evidence  suggests  that  (i)  we  need  a  larger  set  of  topological  properties  to  be  able  to 
characterize  a  graph  exactly,  e.g.,  along  the  lines  of  a  representation  theorem,  and  (ii)  we  need  a 
richer  set  of  statistical  models  for  the  purpose  of  data  analysis. 

An  alternative  approach  to  the  scientific  analysis  of  complex  and  dynamic  graphs  keeps  the 
topological  properties  of  the  data — an  observed  collection  of  paired  measurements — at  the  fore¬ 
front  of  the  analysis.  Along  these  lines,  statistical  models  of  graphs  with  desirable  topological 
properties,  and  their  relation  to  the  popular  notions  of  topology,  are  explored  in  Section  2.2. 


2.2  Exchangeable-Edge  Models 

A  major  issue  with  many  of  the  popular  algorithms  that  have  been  proposed  in  the  literature  to 
generate  topologies  of  interest  is  that  while  they  are  meant  to  replicate  local  and  global  phenomena 
a  procedure  to  estimate  values  for  the  constants  that  correspond  to  those  phenomena,  as  well  as  a 
procedure  to  fit  corresponding  models  to  available  data  and  assess  their  fit,  are  seldom  specified. 
Being  able  to  make  a  good  use  of  the  data  is  crucial  whenever  models  and  algorithms  have  to 
support  substantive  analyses  and  conclusions. 

Example  15.  The  US  Army  believes  that  the  efficiency  of  communications  during  combat  is  di¬ 
rectly  correlated  with  the  outcome  of  a  battle.  The  efficiency  in  this  context  is  defined  in  terms  of 
a  set  of  relevant  network  metrics,  and  being  able  to  monitor  these  metric  is  the  task  of  interest.  In 
particular,  it  is  crucial  to  be  able  to  detect  whenever  communication  patterns  start  displaying  a 
level  of  variability  that  is  considered  abnormal,  non-optimal,  and  ultimately  dangerous.  This  prob¬ 
lem  can  be  stated  formally  as  a  statistical  change-point  problem,  where  detection  has  to  occur  in 
real-time,  on  a  stream  of  data  about  communications  in  the  network.  In  this  context,  a  probabilis¬ 
tic  model  of  a  communication  network,  P(Gt  |0),  corresponds  to  a  statistical  model  for  a  random 
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variable,  P(Xt  |0),  in  the  classical  formulation.  Statistical  procedures  that  lead  to  estimates  of  the 
underlying  constants,  0,  with  desirable  properties,  e.g.,  consistency  and  unbiasedness,  are  critical 
for  detecting  deviations  from  normality,  in  both  formulations. 

Here,  I  discuss  mathematical  characterizations  of  a  graph.  Within  this  context,  I  show  how  it 
is  possible  to  posit  a  general  class  of  probabilistic  models,  which  I  refer  to  as  exchangeable-edge 
models,  that  generate  graphs  with  pre-specified  topological  properties  of  interest,  and  at  the  same 
time  allow  for  formal  inference  and  estimation  procedures.  I  demonstrate  the  utility  and  flexibility 
of  these  models  by  introducing  a  novel  analytical  perspective  of  lognormal  and  cellular  graphs. 

Challenges  to  the  Mathematical  Characterization  of  Graphs  The  minimal  representation  of 
a  graph  G  is  given  in  terms  of  a  set  of  vertices,  A f,  and  a  set  of  edges,  S,  encoded  by  an  adjacency 
matrix,  Y,  where  the  entry  Y  (n,  m )  e  {0, 1}  encodes  the  presence  or  absence  of  the  corresponding 
directed  edge,  n  — >  m.  I  make  no  distinction  between  edges  and  edge  weights  in  the  presentation 
below. 

The  matrix  Y  characterizes  the  topological  properties  of  the  graph  it  represents.  For  example, 
the  degree  of  a  binary  graph  (this  simplifies  things  by  requiring  no  distinction  between  in-  and 
out-degree)  is  defined  as  a  vector  dG  with  generic  element 

dG(n)  =  ^  Y(n,  m)  for  n  G  Af. 

m 

In  general,  the  collection  of  eigenvalues,  Ai;jv,  and  eigenvectors,  Ui-N,  of  Y  give  us  an  exact  char¬ 
acterization  of  the  graph  as  follows, 

1  ^  ^  An-un'Uri , 

nCN 

where  N  is  the  number  of  nodes  in  Af.  In  this  sense,  exact  characterizations  of  a  matrix  provide  us 
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with  exact  characterizations  of  a  graph. 

The  following  question  is  of  interest;  what  set  of  topological  properties  is  necessary  and  suf¬ 
ficient  to  characterize  the  matrix  Y,  exactly?  Here  is  the  challenge;  from  a  statistical  modeling 
perspective  we  would  like  to  characterize  a  graph  in  terms  of  its  essential  topological  properties. 
This  would  allow  us  to  inform  models  of  complex  and  dynamic  graphs  by  analyzing  such  essential 
properties  measured  on  observed  graphs.  On  the  other  hand,  the  results  of  the  previous  section 
suggest  that  there  is  a  disconnect  between  the  popular  classes  of  graphs  for  which  generative  mod¬ 
els  have  been  published  and  characteristics  of  their  topological  properties — as  captured  by  existing 
metrics.  It  is  not  known  whether  a  set  of  topological  properties  exist  that  exactly  characterizes  a 
graph,  or,  alternatively,  what  classes  of  graphs  can  be  defined  in  terms  of  sets  of  constraints  on  a 
set  of  topological  properties.  These  questions  provide  the  context  for  the  investigations  that  follow. 
I  seek  either  exact  or  approximate  characterizations,  either  for  graphs  with  a  finite  set  of  nodes  or 
in  the  infinite  limit  of  large  graphs. 


2.2.1  Specifications  and  Likelihood 

A  first  step  is  to  extend  the  random  graph  models  of  Erdos  and  Renyi  (1959)  and  Gilbert  (1959) 
to  include  a  set  of  latent  variables.  This  will  make  the  edges  exchangeable,  i.e.,  conditionally 
independent  given  values  of  these  latent  variables,  rather  than  independent.  The  latent  variables 
are  themselves  an  IID  sample  from  a  common  distribution.  This  extension  allows  me  to  reproduce 
the  behavior  of  the  original  random  graph  models,  and  to  induce  a  new  array  of  interesting  global 
behaviors  such  as  those  encoded  in  lognormal  graphs  (Section  2.2.2)  and  cellular  graphs  (Section 
2.2.3).  Furthermore,  it  is  possible  to  write  down  and  evaluate  the  likelihood  corresponding  to  such 
models. 

It  is  possible  to  generate  a  diverse  set  of  graphs  by  means  of  exchangeable-edge  models;  a  fairly 
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general  class  of  statistical  models  of  random  graphs.  An  exchangeable-edge  model  of  a  graph  is 
specified  as  follows, 


Y  (n,  m ) 

1  0 

~  P@ 

(2.1) 

0 

A 

~  Pa  , 

(2.2) 

for  each  element  (n,  m )  of  the  matrix  Y;  where  n,  m  are  nodes  in  Af:  0  is  a  collection  of  latent 
variables;  A  is  a  collection  of  hyper-parameters;  and  Pq,Pa  are  probability  distributions.  The 
likelihood  can  be  written  as  follow, 

£(Y\A)  =  [  Pa(Q\A)  n^e(^(n,m)|0)  dB.  (2.3) 

n,m 

In  a  more  general  formulation  Y  may  be  multivariate,  e.g.,  may  encode  multivariate  sociomet¬ 
ric  relations  (Sampson,  1968),  longitudinal,  e.g.,  may  encode  a  temporal  sequence  of  graphs 
(Priebe  et  al.,  2005),  or  even  more  complex. 


Example  16.  In  many  applications,  it  is  convenient  to  distinguish  between  latent  variables  that  are 
partially,  or  potentially  measurable  and  correspond  to  substantive  concepts,  e.g.,  tight  groups  of 
agents  in  the  analysis  of  social  networks.  Following  the  discussion  in  Section  1.2,  we  may  denote 
by  S  the  partially  observable  variables  with  a  substantive  interpretation,  e.g.,  club  membership, 
and  by  0  other  latent  variables,  e.g.,  propensity  to  participate  to  social  activities.  Latent  variables 
in  both  these  collections  are  agent-specific;  namely,  0  =  9\:n  is  a  collection  of  agent- specific 
vectors  whose  components  specify  the  grade  of  membership  of  agents  to  clubs,  whereas  E  —  G-.n 
is  a  collection  of  agent- specific  scalars  that  specify  agents  ’  propensities  to  socialize  with  members 
of  other  clubs.  The  corresponding  exchangeable  edge  model  for  a  sequence  of  graphs  Gl  :r  = 
(' Yi:T,Af ),  that  encodes  social  interactions  recorded  over  T  weeks  among  the  same  set  of  agents, 
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A f,  takes  the  form  of  a  hierarchical  mixed-membership  model.  At  time  t,  it  posits  that 


771 )  |  ( fnmi  @ nm )  ~  0) 


(2.4) 


(2.5) 


(2.6) 


for  each  observed  social  interactions  (n,  m )  recorded  in  the  matrix  Yt  during  the  t-th  week;  where 
n,  m  are  agents  in  A f;  denotes  (fn.  £m);  Onrn  denotes  ( 9n .  9rn);  and  P(£,o),  Pa-  Pb  are  proba¬ 

bility  distributions.  Figure  2.2  illustrates  the  graphical  representation  of  this  model,  using  plates. 
Note  that,  in  this  model,  both  the  degree  of  membership  of  agents  to  clubs,  and  the  propensity 
to  socialize  with  other  agents  do  not  change  from  one  week  to  the  next.  Furthermore,  the  grade 
of  membership  and  the  propensity  to  socialize  are  non-observable,  competing  explanations  of  the 
observed  interactions. 


The  main  advantage  of  these  models  is  that  edges  are  exchangeable,  that  is,  weakly  dependent  6 
rather  than  independent.  Working  within  the  skeletal  specifications  of  Equations  2.1  and  2.2,  we 
can  introduce  layers  in  the  hierarchy  and  posit  stochastic  block  models  (Airoldi  et  ah,  2006d,  and 
Section  3.1),  latent  space  models  (Hoff  et  al.,  2002),  and  diffusion  models  (Coifman  et  al.,  2005a, b, 
and  Section  3.2).  In  Chapter  5, 1  discuss  how  to  introduce  flavors  of  dynamics  and  evolution  in  a 
few  cases — temporal  models  will  break  the  exchangeability  of  edges  within  a  graph  and  introduce 
different  dependence  structures. 


6De  Finetti’s  theorem  implies  exchangeable  edges  can  be  characterized  as  being  independent  conditionally  on  a 
collection  of  latent  variables  (Schervish,  1995). 
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Figure  2.2:  Graphical  representation  of  the  exchangeable  edge  model  in  Example  16. 

2.2.2  Lognormal  Graphs 

Researchers  in  a  diverse  set  of  disciplines  have  offered  evidence  and  arguments  that  ultimately 
purport  the  power-law  distribution  as  an  inevitable  regularity  of  natural  and  artificial  phenomena 
alike.  In  this  section,  I  investigate  the  basis  for  such  claims. 

Main  Argument — Part  I:  Building  Association  Graphs  Consider  the  following  exchangeable- 
edge  generating  process  for  a  (binary,  undirected)  graph  G  =  (Y,  A/”): 

Algorithm  A1 

1 .  for  n  6  Af 

2.  for  k  —  1, . . . ,  K 
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3.  bn(k)  Bernoulli  (  p  ) 

4.  for  n,  m  G  Af 

5.  Y(n,m)  ~  Dirac  (  f(bn,bm)  >  r  ) 

The  constants  underlying  Algorithm  A 1  are  0  =  (AT,p,  /,  r),  where  A'  is  the  number  of  node¬ 
specific  attributes,  bn(k );  p  is  the  probability  that  any  one  of  the  attributes  is  present;  /(■,  •)  is 
a  measure  of  similarity;  and  r  is  the  /-similarity  threshold  beyond  which  two  nodes  are  set  as 
connected.  Algorithm  A1  replicates  a  common  network  construction  process.7  In  such  a  scenario, 
the  node-specific  collections  of  bn  represent  noisy  measurements  about  those  aspects  of  a  node  that 
matter  when  decisions  about  the  presence  of  a  tie  with  other  nodes  have  to  be  made. 

Below  I  explore  the  degree  distribution  of  an  association  graph  that  arises  in  a  case  where  there 
is  no  association  structure  among  nodes,  in  terms  of  their  multivariate  attribute  representations, 
bi,N.  In  order  to  derive  the  degree  distribution  I  completely  specified  Algorithm  A1  by  setting  the 
size  of  the  graph  to  100  nodes;  I  set  the  number  of  latent  aspects,  K,  to  ten,  and  the  probability  that 
any  one  of  the  attributes  is  present,  p,  to  0.5;  in  different  experiments  I  explored  various  measures 
of  association  based  on  Pearson’s  correlation  coefficient,  few  of  many  possible  choices  for  /;  last, 
in  order  to  find  (purportedly)  significant  associations,  I  set  the  /-threshold,  r,  to  be  the  95-th 
percentile  of  the  observed  associations — about  5000  for  each  graph.  I  then  sampled  many  graphs. 
As  an  example,  the  limiting  average  degree  distribution  of  one  of  the  graphs  is  shown  in  Figure 
2.3,  on  a  log-log  scale;  the  corresponding  matrix,  Y,  is  shown  in  Figure  2.4.  The  results  of  the 
simulations  consistently  suggest  a  quadratic  relation  between  nodes’  degree  and  their  frequency,  on 
the  log-log  scale.  That  is,  the  simulated  association  graphs  have  a  lognormal  degree  distribution, 
whenever  no  association  exist  among  their  attribute  representations,  b1:N. 

7 An  alternative  formulation  of  Algorithm  A1  posits  distinct  attribute-specific  probabilities  for  each  node,  p  = 
pn{k),  which  are  sampled  independently  from  a  standard  Gaussian  distribution  and  then  projected  into  the  [0,1] 
interval.  The  attributes,  bn(k),  are  then  independent  samples  from  a  Bernoulli,  conditionally  on  such  probabilities. 
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Figure  2.3:  Observed  degrees  versus  frequency  of  nodes  with  such  a  degree,  over  a  set  of  100 
graphs  sampled  according  to  Algorithm  Al.  Squares  correspond  to  averages. 

Alternative  measurement  schemes  exist,  which  are  based  upon  alternative  measures  of  associ¬ 
ation  that  may  be  adopted  to  decide  when  to  establish  an  edge.  However,  the  empirical  results 
are  robust  to  such  alternatives.  This  empirical  observation  may  be  formalized  by  looking  at  the 
distribution  of  the  association  measure  /  as  follows. 

Conjecture  1.  Any  strategy  that  builds  a  graph  by  thresholding  a  measure  association,  X,  will 
induce  a  degree  distribution  proportional  to  the  tail  of  the  probability  density  of  Px,  where  the  tail 
is  defined  by  the  threshold  r. 


The  general  mechanism  through  which  probability  statements  about  pairs  of  nodes,  e.g.,  in 
terms  of  the  correlations  Pr  (r(n,  m)  >  r),  translate  into  probability  statements  about  the  individ- 
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Figure  2.4:  The  matrix  Y(iooxioo)  of  a  lognormal  graph  sampled  via  Algorithm  Al. 

ual  nodes,  e.g.,  in  terms  of  the  degree  Pr  (d(n)  <  k),  is  fairly  simple.  Consider  the  degree  of  a 
node,  n,  in  a  graph  generated  via  Algorithm  Al, 

d(n)  =  {#  nodes  m  such  that  r(n,  m)  >  r} 

At  a  first  degree  of  approximation,  d(n)  follows  a  Binomial  distribution  with  parameters  N  =  \Af  | , 
i.e.,  the  number  of  nodes  in  the  graph,  and,  p  =  Pr  (r(n,  m)  >  r),  i.e.,  the  probability  of  imputing 
an  edge.  The  heavy-tail  of  the  degree  distribution  follows  from  the  fact  that  the  probabilities  of 
imputing  edges,  P\-n,  are  different  for  the  various  nodes. 

As  we  restrict  the  focus  to  those  thresholding  schemes  that  are  based  upon  Pearson’s  correlation 
coefficient,  r,  we  can  derive  more  precise  results.  Table  2.4  describes  four  measurement  schemes, 
based  on  different  functions  of  r,  in  terms  of  the  probability  density  function  they  induce  on  the 
measure  of  association,  f(r),  and  of  the  range  of  possible  values  for  /,  i.e.  the  support.  Figure  2.5 
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shows  the  estimated  probability  density  functions  corresponding  to  these  four  association  measures 
based  upon  Pearson’s  correlation  coefficient  describes  in  Table  2.4.  Define  rjv  to  be  Pearson’s 
correlation  coefficient,  computed  on  a  sample  of  size  N  of  paired  measurements  (  A",'  .  A2).  The 
asymptotic  distribution  ( N  — >  oo)  of  r/y  is  bell-shaped,  left-skewed.  Fisher  (1915)  derived  the 
distribution  of  Pearson’s  correlation  coefficient  when  the  data  are  bivariate  Gaussian.  In  subsequent 
work  he  showed  that  the  transformation, 

^  =  r<E[-l,+l]  (2.7) 

is  useful  in  stabilizing  the  skewness  of  ra Q,  and  induces  a  Gaussian  distribution  on  the  support  of 
unbiased  for  crpfyx2),  with  variance  ( N  —  3)-1/2  (Fisher,  1921). 

Concluding,  experimental  evidence  suggests  that  graphs  with  a  heavy-tailed  degree  distribution 
may  be  generated  as  artifacts  of  the  way  we  measure  association,  e.g.,  by  thresholding,  even  in 
those  situations  where  no  real  association  exist.  Observing  a  heavy-tailed  degree  distribution  in  a 
graph  should  not  be  regarded  as  interesting  substantive  finding  in  the  absence  of  a  deeper  analysis. 


Main  Argument — Part  II:  Limiting  Graph  Structures  Consider  the  behavior  of  graphs  as  new 
nodes  and  edges  appear.  In  the  limit  of  large  graphs  (i.e.,  many  nodes),  and  for  a  wide  range  of 
aggregation  regimes  (i.e.,  different  edge  addition  rules),  lognormal  graphs  are  an  attractor.  Their 
basin  of  attraction  is  large.  Below,  the  limiting  behavior  of  scale-free  graphs  (Barabasi  et  al.,  1999; 


Table  2.4:  Characteristics  of  four  measurement  schemes  to  establish  associations,  by  thresholding, 
based  upon  Pearson’s  correlation  coefficient. 


Measure 

Support 

PDF 

Notes 

r 

[-1>  +1] 

bell-shaped,  left-skewed 

See  Fisher  (1915) 

c 

(  oo,  +oo) 

Gaussian 

See  Anderson  (1996) 

r2 

[0,  +1] 

bell-shaped,  right-skewed 

el 4 

[0,  Too) 

lognormal 
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Pearson’s  Correlation  Coefficient  Pearson’s  Correlation  Coefficient  (squared) 


Fisher’s  C,  Transformation 


Figure  2.5:  Estimated  probability  density  functions  corresponding  to  the  four  association  measures 
based  upon  Pearson’s  correlation  coefficient  describes  in  Table  2.4.  The  vertical  (red)  bar  indicates 
the  95-th  percentile;  e.g.,  r0  s.t.  P(R  <  r0)  >  0.95. 


Faloutsos  et  al.,  1999;  Huberman  and  Adamic,  1999)  is  revisited  in  the  light  of  this  remark,  in  the 
larger  context  of  statistical  convergence.  Consider  the  following  algorithm. 


Algorithm  A2 

1.  start  with  G  =  (Y  =  0,  AT  =  0) 

2.  repeat 

3.  add  node  n  to  Af 

4.  sample  its  degree  d(n)  P(0) 
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5.  connect  node  n  to  d(n)  existing  nodes  in  M  with  equal  probability 


To  illustrate  the  main  point,  it  is  enough  to  note  that  Algorithm  A2  entails  an  expected  rate  of 
change  of  the  degree  of  a  node,  dt+1  (n)  /  df  (n) ,  constant  overtime.  In  this  case, 


kt+1(n ) 

=  tffn)  ■ 

■  dt(n) 

(2.8) 

log  kt+l(n) 

=  logfc°l 

»  +  K- 

=  i  log  ds(n) 

(2.9) 

kt+1(n)  —  fc*(n) 

=  tffi)  ■ 

■  (cf(n)  — 

1) 

(2.10) 

Akt+1(n)  —  tffn) 

=  /c4(n)  ■ 

■  (?(n). 

(2.11) 

In  the  algorithm  proposed  by  Barabasi  et  al.  (1999)  the  rate  of  change  of  the  degree  is  the 
quantity  that  remains  constant  over  time,  so  that 


E  0[<?(n)] 


A 

mo  t  +  k° 


It  is  possible  to  show  the  connection  in  a  different  way,  by  looking  at  the  limiting  degree  distribu¬ 
tion  of  graphs  where  the  expected  rate  of  change  of  the  degree  is  decreasing  over  time, 


E e[dt(n)]=0(t~a),  a>  0. 


Remark  1.  Power-law  graphs  (also  referred  to  as  scale-free )  provide  a  first-order  approximation 
to  lognormal  graphs,  in  terms  of  their  degree  distribution. 


Consider  the  density  of  the  degree  distribution  of  a  lognormal  graph, 

p(x\p,  a2)  =  - 1— ,  fi,  e  M,  and  x,  a  e  R+.  (2.12) 

xaV2ir 
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This  entails  a  quadratic  relation  between  log  density  and  log  degree, 

X  2 

logp  (  X  |  /I,  a  )  =  -  log(x)  -  log(aV27r)  -  (  log(x)  -  /i  ) 

=  -  log(ffV^)  -  ^2  “  M®)  “  ^2  ^S2^)- 

Taylor-expand  logp,  with  respect  to  logx,  to  the  first  order, 

log p  (  x  |  fi,a)  «  log  p  (  xq  |  /r,  o' )  2X°  ( log(x)  -  x0  )  . 

If  we  choose  xq  =  /i,  the  mean,  the  expression  above  simplify  to 

logp  (  x  |  /i,  a  )  ~  —  log(x)  —  log(av/27r), 

which  implies  a  linear  relation  between  log  density  and  log  degree,  that  is,  the  relation  underlying 
the  degree  distribution  of  a  power-law  graph. 

Thus  we  see  that  exchangeable-edge  models  illustrate  how  lognormal  graphs  arise  in  two  in¬ 
teresting  sets  of  circumstances:  (i)  as  an  artifact  of  the  way  we  measure  associations;  and  (ii)  in 
the  limit,  as  a  consequence  of  multiplicative  aggregation  processes.  Inasmuch  as  they  approximate 
lognormal  graphs,  power-law  graphs  may  arise  in  the  same  circumstances.  This  fact  may  help  to 
explain  their  ubiquity. 


2.2.3  Cellular  Graphs 

Despite  the  fact  they  are  so  pervasive,  no  characterization  exists  that  explains  why  cellular  net¬ 
works  arise  as  a  structure  of  collective  organization.  What  are  the  conditions  that  naturally  conjure 
cellular  structures  among  individuals?  Below,  I  introduce  a  simple  exchangeable-edge  model  that 
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suggests  a  possible  answer. 

I  posit  a  stylized  model  of  a  population  of  agents  with  limited  resources.  In  particular,  I  assume 
that  one  of  the  resources  (e.g.,  time)  is  instrumental  in  acquiring  other  resources  (e.g.,  knowledge). 
Individual  agents  are  endowed  with  a  limited  amount  of  the  former  (e.g.,  hours  in  a  day).  This 
limitation  imposes  choices  on  the  agents  about  how  to  allocate  the  instrumental  resource.  In  a 
model  with  time  as  the  limited  resource,  for  example,  the  choices  to  be  made  by  the  agents  may 
concern  which  interests  to  cultivate.  Given  the  set  of  individual  choices,  the  network  among  agents 
emerges  as  a  consequence  of  a  simple  social  aggregation  process  that  induces  ties  between  pairs  of 
agents  with  similar  interests. 

Algorithm  A3 

1.  for  n  e  Af 

2.  pn  ~  logit  MVN  (  0,  E(a)  ) 

3.  for  k  —  1, ... ,  K 

4.  bn(k)  Bernoulli  (  pn(k )  ) 

5.  for  n,  m  e  Af 

6.  Y(n,m)  ~  Dirac  (  f(bn,bm)  >  r) 

The  constants  underlying  Algorithm  A2  are  0  =  ( K ,  /,  a ,  r),  where  K  is  the  number  of  node¬ 
specific  attributes,  bn(k);  /(•,  •)  is  a  measure  of  similarity;  r  is  the  /-similarity  threshold  beyond 
which  two  nodes  are  set  as  connected;  and  a  is  the  exclusivity  parameter  that  I  shall  now  discuss. 

Algorithm  A2  is  fairly  similar  to  the  algorithm  I  used  to  generate  lognormal  graphs.  The  N 
agents,  i.e.,  the  nodes,  are  associated  with  binary  strings,  fe1:Ar,  whose  components  indicate  the 
presence  or  absence  of  a  specific  interest,  out  of  K  possible.  There  are  two  changes  here;  (i)  the 


62 


CHAPTER  2.  RANDOM  GRAPHS  REVISITED 


E.M.  AIROLDl 


Figure  2.6:  Example  interest  vectors  for  25  nodes. 


fact  the  elements  bn(k)  are  sampled  with  agent-specific  probabilities;  and  (ii)  the  agent- specific 
probability  vectors,  pn,  have  weakly  dependent  components — this  is  crucial.  Specifically,  the 
variance-covariance  matrix  £  =  E(o)  has  entries 


a2(K  —  1) 
(Ka)2(Ka  +  1)  ’ 


and 


®nm 


—2a 

(Ka)2(Ka  +  l)' 


(2.13) 


specified  in  terms  of  the  scalar  parameter  a.  The  moments  of  the  multivariate  normal  distribution 
in  step  2  of  Algorithm  A3  are  reminiscent  of  those  of  a  Dirichlet  distribution.  The  main  difference 
between  the  the  multivariate  normal  and  the  Dirichlet  is  that  the  support  of  the  former  is  the  unit 
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Figure  2.7:  The  matrix  Y(iooxioo)  of  a  cellular  graph  sampled  via  Algorithm  A3. 


hyper-cube  after  a  logistic  transformation  of  its  coordinates, 


Pn(k) 


exp{zfc} 

1  +  Efc  exp{£fc}  ’ 


for  i 


1  -K, 


whereas  the  support  of  the  latter  is  the  K -dimensional  simplex.  The  variance-covariance  structure 
£(a)  enforces  what  I  term  exclusivity  of  interests  in  the  following  sense;  by  dedicating  time  to 
acquire  a  specific  interest,  agents  implicitly  choose  not  to  pursue  other  interests.  The  parameter  a 
has  support  in  (0,  oo);  the  closer  a  is  to  zero,  the  stronger  it  promotes  exclusivity.  Furthermore, 
a  <  1  promotes  exclusivity,  whereas  a  >  1  implies  that  agents  are  likely  to  devote  an  equal 
amount  of  time  to  each  one  of  the  K  available  interests. 

Figure  2.6  show  an  example  of  interest  vectors,  bn,  for  25  nodes,  in  a  simulation  where  the 
number  of  interests,  K,  is  five.  Figure  2.7  shows  one  of  the  cellular  graphs  generated  by  Algorithm 
A3.  The  parameters  were:  100  nodes,  K  =  5,  /  is  Pearson’s  correlation  coefficient,  and  r  is  the 
15-th  percentile  of  the  observed  associations.  Figure  2.8  shows  an  aspect  of  the  emergence  of 


64 


CHAPTER  2.  RANDOM  GRAPHS  REVISITED 


E.M.  AIROLDI 


Figure  2.8:  The  clustering  coefficient  as  a  function  of  a  obtained  with  Algorithm  A3. 


communities  induced  by  Algorithms  A3.  In  particular,  a  sharp  drop  in  the  clustering  coefficient 
occurs  as  the  exclusivity  parameter  a  increases  from  fa  0  towards  1;  when  a  >  1  the  community 
structure  is  no  longer  present. 

By  using  Algorithm  A3  I  show  how  communities  form  because  of  the  joint  effect  of  two  simple 
factors:  (i)  exclusivity,  that  is,  the  need  for  allocating  resources  to  competing  interests — which 
may  be  induced  by  the  finiteness  of  such  resources;  and  (ii)  homophily,  that  is,  the  fact  that  social 
interactions  are  more  likely  to  occur  between  individuals  who  share  interests  than  between  those 
who  do  not.  Communities  emerge,  quickly,  as  a  <  1  — ■>  0. 
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2.3  Convex  Generation  of  Graphs  with  Degree  Constraints 

Here  I  express  the  problem  of  finding  the  most  likely  graph  with  a  given,  arbitrary  degree  distri¬ 
bution  as  a  convex  optimization  problem. 

There  are  many  situations  where  being  able  to  sample  graphs  with  a  given  arbitrary  degree 
distribution  is  crucial.  One  of  the  main  issues  with  the  existing  algorithms  is  their  failure  rate, 
that  is,  the  number  of  times  such  algorithms  have  to  be  restarted  in  order  to  produce  a  graph 
that  satisfies  the  given  constraint  on  the  degree  distribution  (Milo  et  al.,  2004c)  .  Expressing  the 
problem  of  finding  likely  graphs  that  satisfy  the  constraint  on  the  degree  distribution  as  a  convex 
optimization  problem  will  resolve  this  issue.  In  fact,  once  we  have  a  starting  point  in  a  high 
probability  region  of  the  space  of  feasible  graphs,  sampling  graphs  at  random  that  satisfy  a  given 
constraint  is  easier — leads  to  a  much  lower  failure  rate. 

Consider  a  undirected,  unipartite  graphs,  G  =  (Y,J\f)  E  Q.  A  basis  for  Q  is  given  by  the 
collection  of  graphs  Enm  =  (Ynrn,  M),  indexed  by  pairs  of  nodes  (u,  v )  in  A f.  The  element  (?',  j) 
of  the  adjacency  matrix  of  Enm  is  defined  as, 


1  if  (i  —  n,  j  =  m)  or  (i  —  m,j  =  n ) 


0  otherwise. 


It  is  then  possible  to  write  the  problem  of  finding  a  graph,  Y,  with  a  pre-specified  degree  sequence, 


d,  as  follows: 


(2.14) 


ee£G 

s.t.  Y  ■  1  =  d  and  ae  E  (0;  1}  for  all  e, 


where  £G  is  the  set  of  all  possible  edges  among  pairs  of  nodes  of  G.  Simulated  annealing,  for  exam- 
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pie,  can  be  used  to  find  the  unique  solution.  Problem  2.14  is  formulated  as  a  convex  optimization 
(Boyd  and  Vandenberghe,  2004). 


*  *  * 

In  this  chapter,  I  introduced  exchangeable-edge  models  of  graphs.  I  argue  that: 

•  they  represent  an  important  extension  of  the  popular  random  graph  model  of  Erdos  and  Renyi 
(1959)  and  Gilbert  (1959),  technically,  by  involving  a  layer  of  latent  variables; 

•  they  are  proper  statistical  (Bayesian)  models,  in  the  sense  that  we  can  write  down  and  eval¬ 
uate  their  likelihood,  and  can  therefore  be  used  for  principled  analyses  of  data. 

To  substantiate  these  claims,  I  presented  a  novel  analysis  of  lognormal  and  cellular  networks  based 
upon  them.  Methodology  for  statistical  network  analysis  is  presented  in  Chapter  3  as  well.  The 
development  of  exchangeable-edge  models  has  led  me  to  the  useful  formulation  of  the  problem  of 
sampling  graphs  with  given  degree  constraints  as  a  convex  optimization  problem. 
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Chapter  3 


Discovering  Latent  Patterns 


In  this  Chapter,  I  work  within  the  general  formulation  of  exchangeable-edge  models  of  Section  2.2 
to  specify  Bayesian  mixed-membership  models  of  random  graphs  that  are  used  to  discovery  latent 
patterns. 


Introduction  and  Motivation  Statistical  models  are  used  to  inform  scientific  analyses  of  graphs 
and  networks  that  encode  observations  about  phenomena  of  interest  (Holland  and  Leinhardt,  1975; 
Wasserman,  1980;  Fienberg  et  al.,  1985;  Wasserman  and  Pattison,  1996;  Watts  and  Strogatz,  1998; 
Cooper  and  Frieze,  2003;  Kemp  et  al.,  2006).  Often,  we  can  specify  models  via  probabilistic  al¬ 
gorithms  that  generate  nodes  and/or  edges  in  a  hierarchical  fashion,  starting  from  a  small  set  of 
underlying  constants.  Specifications  of  hierarchical  dependencies  among  such  constants,  other 
non-observable  quantities  possibly  generated  in  intermediate  steps,  and  the  data  provide  a  channel 
to  inform  the  analysis  with  structural  assumptions  that  are  relevant  for  a  specific  application.  For 
instance,  in  social  networks  analysis  the  social  context  in  which  actors  interact,  or  the  group  which 
actors  are  members  of,  are  examples  of  such  non- observable  (or  partially  observable)  quantities 
that  may  be  useful  to  explain,  e.g.,  email  communications  among  a  group  of  employees  within  a 
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company,  or  interactions  among  a  set  of  proteins  in  a  certain  tissue  under  specific  experimental 
conditions. 

The  three  main  approaches  proposed  in  the  social,  mathematical  and  computing  sciences  lit¬ 
eratures,  that  make  use  of  non- observable  quantities1  to  express  specific  concepts  relevant  to  an 
application  domain,  as  (i)  latent  space  models,  (ii)  block  models,  and  (iii)  diffusion  models.  More 
in  detail,  in  (i)  the  N  nodes  in  the  graph  are  projected  onto  a  latent  space,  0,  in  a  way  that  edges 
are  preserved  whenever  the  distance  of  their  projections  is  high  enough,  e.g.,  d(0n ,  9m)  exceeds  a 
threshold.  In  (ii)  the  graph  is  summarized  in  terms  of  a  noisy  block  structure,  B,  along  with  the 
memberships  of  nodes  to  blocks,  tti-.n,  as  detailed  in  Section  3.1.  In  (iii)  mathematical  functional 
defined  on  the  graph  are  proposed  to  study  the  diffusion  process  of  real  or  informational  artifacts 
among  the  nodes.  I  will  focus  on  extending  block  models  for  a  major  portion  of  this  chapter.  I  will 
then  revisit  latent  space  models  and  diffusion  models  towards  the  end  of  the  chapter. 


3.1  Admixture  of  Latent  Blocks  Model 

Relational  information  arise  in  a  variety  of  settings,  e.g.,  in  scientific  literature  papers  are  con¬ 
nected  by  citation,  in  the  word  wide  web  the  webpages  are  connected  by  hyperlinks,  and  in  cellular 
systems  the  proteins  are  often  related  by  physical  protein-protein  interactions  revealed  in  yeast- 
two-hybrid  experiments.  These  types  of  relational  data  violate  the  assumptions  of  independence 
or  exchangeability  of  objects  adopted  in  many  conventional  analyses.  In  fact,  the  relationships 
themselves  between  objects  are  often  of  interest  in  addition  to  the  object  attributes.  For  example, 
one  may  be  interested  in  predicting  the  citations  of  newly  written  papers  or  the  likely  links  of  a 
web-page,  or  in  clustering  cellular  proteins  based  on  patterns  of  interactions  between  them. 

1 A  mainstream  approach  to  statistical  network  analysis  that  (for  the  most  part)  does  not  make  use  of  latent  variables 
is  presented  in  the  book  by  Wasserman  and  Faust  (1994). 
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In  many  such  applications,  clustering  the  objects  of  study  or  projecting  them  in  a  low  dimen¬ 
sional  space  (e.g.,  a  simplex)  is  only  one  of  the  goals  of  the  analysis.  Being  able  to  estimate 
the  relational  structures  among  the  clusters  themselves  is  often  as  important  as  object  clustering. 
For  example,  from  observations  about  email  communications  of  a  study  population,  one  may  be 
not  only  interested  in  identifying  groups  of  people  of  common  characteristics  or  social  states,  but 
also  at  the  same  time  exploring  how  the  overall  communication  volume  or  pattern  among  these 
groups  can  reveal  the  organizational  structures  of  the  population.  Furthermore,  in  modern  net¬ 
work  analysis  tasks  described  above,  it  is  desirable  to  also  relax  the  unary-aspect  assumption  on 
each  node  imposed  by  extant  models.  To  this  extent,  I  introduce  a  new  class  of  models  based 
the  principle  of  stochastic  block  models  of  mixed  membership,  which  combines  features  of  the 
mixed-membership  models  (Erosheva  and  Fienberg,  2005)  and  the  block  models  (Holland  et  al., 
1983;  Anderson  et  al.,  1992;  Nowicki  and  Snijders,  2001;  Doreian  et  al.,  2004)  via  a  hierarchical 
Bayesian  framework,  and  offers  a  flexible  machinery  to  capture  rich  semantic  aspects  of  various 
network  data — see  Section  4.2.2  for  a  general  formulation. 

Below,  I  describe  an  instantiation  of  this  class  of  models,  referred  to  as  admixture  of  latent 
blocks  (ALB)  to  reasons  to  be  explained  shortly,  for  analyzing  networks  of  objects  with  multiple 
latent  roles,  e.g.,  social  activities  in  case  the  objects  refer  to  people  (Airoldi  et  al.,  2007b),  or 
biological  functions  in  case  the  objects  refer  to  proteins  (Airoldi  et  al.,  2006c).  As  mentioned 
above,  classical  network  models  such  as  the  stochastic  block  models  only  allow  each  nodes  to 
bear  a  single  role.  Our  model  alleviates  this  constraint,  and  furthermore  posits  that  each  nodes 
can  adopt  different  roles  when  interacting  with  different  other  nodes.  In  Section  4.2  of  Chapter 
4, 1  will  describe  the  general  model  formulation  for  multivariate  relations  along  with  the  general 
model  formulation  for  multivariate  attributes. 

Historical  Notes  A  popular  class  of  probabilistic  models  for  relational  data  analysis  are  based  on 
the  stochastic  block  model  (SBM)  formalism  for  psychometric  and  sociological  analysis  pioneered 
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by  Holland  and  Leinhardt  (1975),  and  later  extended  in  various  contexts  (Fienberg  et  al.,  1985; 
Wasserman  and  Pattison,  1996;  Snijders,  2002;  Hoff  et  al.,  2002;  Doreian  et  al.,  2004).  In  ma¬ 
chine  learning,  Markov  random  networks  have  been  used  for  link  prediction  (Taskar  et  al.,  2003) 
and  the  traditional  block  models  have  been  extended  to  include  nonparametric  Bayesian  priors 
(Kemp  et  al.,  2004,  2006)  and  to  integrate  relations  and  text  (McCallum  et  al.,  2007).  Typically, 
these  models  posit  that  every  node  in  a  study  network  is  characterized  by  a  unary  latent  aspect 
that  accounts  for  its  interaction  patterns  to  peers  in  the  networks;  and  conditioning  on  the  observed 
network  topology  one  can  reason  about  these  latent  aspects  of  nodes  via  posterior  inference.  These 
formulations  are  closely  related  to  the  one  introduced  here. 

Largely  disjoint  from  the  network  analysis  literature,  methodologies  for  latent  aspect  model¬ 
ing  have  also  been  widely  investigated  in  the  contexts  of  different  informational  retrieval  problems 
concerning  modeling  the  high-dimensional  non-relational  attributes  such  as  text  content  or  genetic- 
allele  profile.  In  many  of  these  domains,  variants  of  a  mixed  membership  formalism  have  been  pro¬ 
posed  to  capture  a  more  realistic  assumption  about  the  observed  attributes,  that  the  observations  are 
resulted  from  contributions  from  multiple  latent  aspects  rather  than  a  unary  aspects  as  assumed  in 
most  extant  network  models  such  as  SBM.  The  mixed  membership  models  have  emerged  as  a  pow¬ 
erful  and  popular  analytical  tool  for  analyzing  large  databases  involving  text  (Blei  et  al.,  2003),  text 
and  references  (Cohn  and  Hofmann,  2001;  Erosheva  et  al.,  2004),  text  and  images  (Barnard  et  al., 
2003),  multiple  disability  measures  (Erosheva  and  Fienberg,  2005;  Manton  et  al.,  1994),  and  ge¬ 
netics  information  (Rosenberg  et  al.,  2002;  Pritchard  et  al.,  2000;  Xing  et  al.,  2003c).  These  mod¬ 
els  often  employ  a  simple  generative  model,  such  as  a  bag-of-words  model  or  a  naive  Bayes, 
embedded  in  a  hierarchical  Bayesian  framework  involving  a  latent  variable  structure  that  com¬ 
bines  multiples  latents  aspects.  This  scheme  induces  dependencies  among  the  objects’  relational 
behaviors  in  the  form  of  probabilistic  constraints  over  the  estimation  of  what  might  otherwise  be 
an  extremely  large  set  of  parameters. 
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3.1.1  Goals  of  the  Analysis 

I  am  concerned  with  modeling  data  represented  as  a  collection  of  directed  uniparti te  graphs. 
A  unipartite  graph  is  a  graph  whose  nodes  are  of  a  single  type,  e.g.,  individual  human  beings 
in  case  of  a  person-to-person  communication  network,  as  opposed  to  bipartite  and  multipartite 
graphs,  where  the  nodes  are  of  two  or  multiple  types,  e.g.,  genes-to-experiments  (Blei  et  al.,  2003; 
Airoldi  et  al.,  2006f)  or  employees-to-tasks-to-resources  (Carley,  2002).  Consider  a  collection  of 
unipartite  graphs  whose  edges  encode  measurements  on  pair  of  nodes  about  a  response  variable. 
Multiple  graphs  encode  replicates,  M,  of  the  same  relation.  Denote  the  collection  of  graphs  by 
Q  =  {Gm  :  r  —  1, . . . ,  M},  where  each  graph,  Gm,  is  defined  over  a  common  set  of  nodes,  A f. 
The  random  variables  that  encode  edge  weights  are  denoted  by  Rm(p,  q ),  where  (p,  q )  is  a  pair  of 
nodes  in  A f. 

Example  17.  Sampson  (1968)  described  a  collection  of  relationships  measured  among  a  group  of 
monks  in  a  monastery.  He  observed  responses  about  typically  asymmetric  relations  such  as  “Do 
you  like  monk  X?”,  at  a  sequence  of  epochs.  This  information  is  representable  as  a  collection  of 
M  graphs,  where  the  edges  encode, the  binary  “like”  responses. 

Example  18.  Mewes  et  al.  (2004)  describe  the  set  of  hand  curated  protein  interactions  produced 
by  the  Munich  Institute  for  Protein  Sequencing.  A  single  set  of  interactions  between  proteins 
has  been  experimentally  verified.  This  information  is  representable  as  a  single  graph  where  the 
random  variables  associated  with  the  edges  are  binary.  See  the  reanalyses  in  (Airoldi  et  al.,  2006c) 
for  further  details. 

The  analysis  of  such  data  typically  focuses  on  the  following  objectives:  (1)  identifying  cluster¬ 
ing  of  nodes;  (2)  determining  the  number  of  clusters;  and  (3)  estimating  the  probability  distribution 
of  interactions  among  actors  within  and  between  clusters.  For  instance,  in  the  monestary  social 
network  of  Example  17,  objective  1  translates  to  identifying  the  solid  factions  among  monks,  In 
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Clusters  & 

Mappings 

(latent) 


Nodes  & 
Relations 


(observable) 


Map  B 
not  a  tree 

Map  nI:N 
not  1  -to- 1 


Inference 


Figure  3.1:  The  scientific  problem  at  a  glance.  The  goal  of  the  analysis  is  to  make  inference  on  two 
mappings;  nodes-to-clusters  (via  7?1:jv)  and  clusters-to-clusters  (via  B).  The  facts  that  B  does  not 
necessarily  encode  a  tree,  and  that  n1:N  is  not  necessarily  one-to-one  distinguish  our  formulation 
from  typical  hierarchical  and  hard  clustering. 

addition  one  wants  to  determine  how  many  factions  are  likely  to  exist  in  the  monastery,  and  how 
the  factions  relate  to  one  another.  Typically,  unsupervised  learning  experiments  are  performed,  or 
semi- supervised  learning  experiments  with  minimal  information  available  in  terms  of  membership 
of,  say,  monks  to  factions.  Working  in  the  hierarchical  Bayes  framework,  we  can  either  specify  the 
constants  underlying  the  distribution  of  random  quantities  at  the  top  level  of  the  hierarchy  (i.e.,  the 
hyper-parameters)  or  estimate  them  via  empirical  Bayes  methods.  This  methodology  accommo¬ 
dates  hypothesis  testing  about  the  existence  of  specific  relational  structure  among  clusters. 


3.1.2  Model  Specifications 

The  approach  detailed  below  employs  a  hierarchical  Bayesian  formalism  that  encodes  statistical 
assumptions  underlying  a  network  generative  process.  This  process  generates  the  observed  net¬ 
works  according  to  the  latent  distribution  of  the  hypothetical  group-involvement  of  each  monk,  as 
specified  by  a  mixed-memembership  multinomial  vector  n  :  =  [7Ti, . . . ,  tik)'  where  tt,  denotes  the 
probability  of  a  monk  belonging  to  group  r,  and  the  probabilities  of  having  interactions  between 
different  groups,  as  defined  by  a  matrix  of  Bernoulli  rates  B(Ky  Kj  =  { By }  where  B,:I  represents 
the  probability  of  having  a  link  between  a  monk  from  group  i  and  a  monk  from  group  j.  Each 
monk  is  associated  with  a  unique  n,  meaning  that  he  can  be  simultaneously  belonging  to  multi- 
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pie  groups,  and  the  degree  of  involvements  in  different  groups  is  unique  for  each  monk;  and  7 r  of 
different  monks  independently  follow  a  Dirichlet  distribution  parameterized  by  a. 

More  generally,  for  graph  m  and  each  node,  let  indicator  vector  2  z^Cq  denote  the  group  mem¬ 
bership  of  node  p  when  it  is  to  approach  with  node  q\  let  z™.  q  denote  the  group  membership  of 
node  q  when  it  is  approached  by  node  p;  let  Ar  :=  |A/"|  denote  the  number  of  nodes  in  the  graph; 
and  let  K  denote  the  number  of  distinct  groups  a  node  can  belong  to.  An  admixture  of  latent 
blocks  (ALB)  model  posit  that  a  sequence  of  M  networks,  G \:m  =  {Ri-.m,  A/”),  can  be  instantiated 
according  to  the  following  procedure: 

Algorithm  Al  :  (A/",  M,  K,  a,  B )  — >  Ri-.m- 

1 .  For  each  node  p  G  J\f 

1.1.  Sample  np  ~  Dirichlet  (a). 

2.  For  each  interaction  network  rri  —  1 , ,M 

2.1.  For  each  pair  of  nodes  (p,  q)  G  AC  ®  Af 

2.1.1.  Sample  group  z^Cq  ~  Multinomial  (7 fp,  1) 

2.1.2.  Sample  group  z^_q  ~  Multinomial  (7 rq,  1) 

2.1.3.  Sample  Rm(p,  q)  ~  Bernoulli  {z^_JqB  z^_q) 

It  is  noteworthy  that  in  the  above  model,  the  group  membership  of  each  node  is  context  dependent , 
that  is,  each  nodes  can  assume  different  membership  when  interacting  to  or  being  interacted  by 
different  peers.  Therefore,  each  node  is  statistically  an  admixture  of  group-specific  interactions, 
and  I  denote  the  two  sets  of  latent  group  indicators  corresponding  to  the  m-th  observed  network 

2An  indicator  vector  of  memberships  in  one  of  the  K  groups  is  defined  as  a  K -dimensional  vector  of  which  only 
one  element  whose  index  corresponds  to  the  id  of  the  group  to  be  indicated  equals  to  one,  and  all  other  elements  equal 
to  zero. 
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by  {z£Aq  :  p,q  G  AT}  =:  Z ^  and  {z^q  :  p,q  e  A f}  =:  Z^.  Marginalizing  out  the  latent  group 
indicators,  it  is  easy  to  show  that  the  probability  of  observing  an  interaction  between  node  p  and  q 
across  the  M  networks  is  apq  =  7vp  B  ivq. 

Under  an  ALB  model  outlined  above,  the  joint  probability  distribution  of  the  data,  and 

the  latent  variables  (Hi-.n,  ZPM.  Z'qM)  can  be  written  in  the  following  factored  form: 


p(R1:M\a,B) 


P(Rl:Mi  7T 1  :N,  Z^.M,  Z^.M\a,  B )  dn  dZ 


'1102 


Jn®z  \  m  pq 

X  p2(z^q\nq,l)  n  p3(7Tp|a)  dir  dZ 


(3.1) 


where  p\  is  Bernoulli,  p2  is  multinomial,  and  p3  is  Dirichlet. 

To  compute  the  likelihood  of  the  observed  networks,  one  needs  to  marginalize  out  the  hidden 
variables  if  and  Z  for  all  notes,  which  is  intractable  for  even  for  small  graphs.  In  Section  3.1.3, 1 
describe  a  variational  scheme  to  approximate  this  likelihood  for  parameter  estimation. 


Dealing  with  Sparsity  Most  networks  in  real  world  are  sparse,  meaning  that  most  pairs  of  nodes 
do  not  have  edges  connecting  them.  But  in  many  network  analyses,  observations  about  interactions 
and  non-interactions  are  equally  important  in  terms  of  their  contributions  to  model  fitness.  In 
other  words,  they  would  compete  for  a  statistical  explanation  in  terms  of  estimates  for  parameters 
(a,  B ),  and  would  both  influence  the  distribution  of  latent  variables  such  as  vf1:Ar.  A  non  desirable 
consequence  of  this,  in  scenarios  where  interactions  are  rare,  is  that  parameter  estimation  and 
posterior  inference  would  explain  patterns  of  non-interaction  rather  than  patterns  of  interaction. 

In  order  to  be  able  to  calibrate  the  importance  of  rare  interactions,  we  introduce  the  sparsity  pa¬ 
rameter  p  e  [0, 1],  which  models  how  often  a  non-interaction  is  due  to  measurement  noise  (which 
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is  common  in  certain  experimentally  derived  networks  such  as  the  protein-protein  interaction  net¬ 
works)  and  how  often  it  carries  information  about  the  group  memberships  of  the  nodes.  This  leads 
to  a  small  extension  of  the  generative  process  outlined  in  the  last  subsection.  Specifically,  instead 
of  drawing  an  edge  directly  from  a  Bernoulli  with  rate  z^_JqB  ,  now  we  sample  an  interaction 
with  probability  cr™  =  (1  —  p)  ■  z^_JqB  z^_q;  therefore  the  probability  of  having  no  interaction  this 
pair  of  nodes  is  1  —  a™  =  (1  —  p)  ■  1  —  B)  z!£_q  +  p.  This  is  equivalent  to  re-parameterizing 

the  interaction  matrix  B.  During  estimation  and  inference,  a  large  value  of  p  would  cause  the  in¬ 
teractions  in  the  matrix  to  be  weighted  more  than  non-interactions  in  determining  the  estimates  of 
(a,  B,  7T 1 ; TV )  * 

3.1.3  Estimation  and  Inference 

I  use  an  empirical  Bayes  framework  for  estimating  the  parameters  (a,  B),  and  employ  a  mean- 
field  approximation  scheme  (Jordan  et  al.,  1999)  for  posterior  inference  of  the  (latent)  mixed- 
membership  vectors,  7fi:jv-  Model  selection  can  be  performed  to  determine  the  plausible  value 
of  K — the  number  of  groups  of  nodes — based  on  a  strategy  described  in  Airoldi  et  al.  (2006e). 

In  order  to  estimate  (a,  B )  and  infer  the  posterior  distributions  of  % t1:N  we  need  to  be  able  to 
evaluate  the  likelihood,  which  involves  the  non-tractable  integral  over  Z  and  7Ti :n  in  Equation  3.1. 
Given  the  large  amount  of  data  available  for  most  networks,  we  focus  on  approximate  posterior 
inference  strategies  in  the  context  of  variational  methods,  and  we  find  a  tractable  lower  bound  for 
the  likelihood  that  can  be  used  as  a  surrogate  for  inference  purposes.  This  leads  to  approximate 
MLEs  for  the  hyper-parameters  and  approximate  posterior  distributions  for  the  (latent)  mixed- 
membership  vectors. 

Variational  Expectation-Maximization  The  approximate  variant  of  EM  I  describe  here  is  often 
referred  to  as  Variational  EM  (Beal  and  Ghahramani,  2003;  Blei  et  al.,  2003).  Begin  by  rewriting 
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Y  =  Ri-.m  for  the  data,  X  =  ( tt , .  v -  Y\Yr-  f°r  the  latent  variables,  and  0  =  (a,  B)  for  the 
model’s  parameters.  Briefly,  it  is  possible  to  lower  bound  the  likelihood,  p(Y |0),  making  use  of 
Jensen’s  inequality  and  of  any  distribution  on  the  latent  variables  q(X), 

p(y|0)  =  log  [  P(Y,x\e)dx 
J  x 

=  log  J  q{X)  dx  (for  any  <?) 

>  [  q( X)  log  —  dX  (Jensen’s) 

Jx  (AX) 

=  E9  [  logp(Y,  A|0)  —  logg(X)  ]  =:  £(q,Q)  (3.2) 

In  EM,  the  lower  bound  C(q,  0)  is  then  iteratively  maximized  with  respect  to  0,  in  the  M  step,  and 
q  in  the  E  step  (Dempster  et  al.,  1977).  In  particular,  at  the  t-th  iteration  of  the  E  step  we  set 

q^=p{X\Y,Q^),  (3.3) 

that  is,  equal  to  the  posterior  distribution  of  the  latent  variables  given  the  data  and  the  estimates  of 
the  parameters  at  the  previous  iteration. 

Unfortunately,  the  posterior  in  Equation  3.3  for  the  admixture  of  latent  blocks  model  cannot 
be  computed.  Rather,  a  direct  parametric  approximation  to  it  needs  be  defined,  q  =  q&(X), 
which  involves  an  extra  set  of  variational  parameters,  A,  and  entails  an  approximate  lower  bound 
for  the  likelihood  C/±(q,  0).  At  the  t-th  iteration  of  the  E  step,  the  Kullback-Leibler  divergence 
between  q^  and  q^,  is  then  minimized  with  respect  to  A,  using  the  data.3  The  optimal  parametric 
approximation  is,  in  fact,  a  proper  posterior  as  it  depends  on  the  data  Y,  although  indirectly,  q(i>  « 
9a-(y)(X)  =P(XV). 


3This  is  equivalent  to  maximizing  the  approximate  lower  bound  for  the  likelihood,  £a(<7,  0),  with  respect  to  A. 
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Lower  Bound  for  the  Likelihood  According  to  the  mean-field  theory  (Jordan  et  al.,  1999;  Xing  et  al., 
2003b),  one  can  approximate  an  intractable  distribution  such  as  the  one  defined  by  Equation  3.1 
by  a  fully  factored  distribution  Z  CM-  ^Cm)  defined  as  follows: 


Z1:M,  Z1:M\Cl:N,  :Mi  ^niw) 

p  m  p,q 


where  qy  is  a  Dirichlet,  q2  is  a  multinomial,  and  A  =  (71  :N,  I>Cm-  ®i7m)  represent  the  set  of  free 
variational  parameters  need  to  be  estimated  in  the  approximate  distribution. 

Minimizing  the  Kulback-Leibler  divergence  between  this  7 ( tt  1 . ,  Z^M,  Z^~M |  A)  and  the  origi¬ 
nal  p( 7fi:jv,  Z'Cm-  X, ~M)  defined  by  Equation  3.1  leads  to  the  following  approximate  lower  bound 
for  the  likelihood. 


(q,0) 


e,  [  lognn  } 

m  p,q 

[^nn^i^1)]  +e« 

m  p,q  m  p,q 

E<?  [  log  n  P3(v?p|a)  ]  -Eq  [  Ql^pllp)  ] 

P  P 

E,  [io«nn  u  i  -e,  [^nn  *^1$-,.  u  i  • 

m  p,q  m  p,q 
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Working  on  the  single  expectations  leads  to  the  following  expression, 


(9,0)  =  ),B(g,h)) 

m  p,q  g,h 

+  Y<  Ys  Ys  ^-><1,9  [  —  ^CY  Tp,g)  ] 

m  p,q  g  g 

+  YYY tZ-Q*  i ^(tp,/*)  - ^CY ip*)  ] 

m  p,q  h  h 

+  Ylo^r(Y  Uk )  -  Ylo^TM  +  Y(ak  -  x)  [  ^(t P,k)  -  ] 

p  k  p,k  p,k  k 

-  Elosr<E  7p,p)  +  E  log  r(7p,t)  -  EEp.*  -  1)  [  V>(7p,t)  -  iWE  7p.»)  ] 


EEE 


m  p,q  g 


k  p,k 

iTO  log  <?im 

p— O  T^p— 


p,k 


EEE  V'p*-^  V'p^fc 

m  p,q  h 


Lm  logC 


where 


/  (  Rm(p,q),B(g,h)  )=  Rm(p,q)  log  £(0,/i)+  (  1  -  Rm{p,q)  )  log  ( 1  -  £(s,/0  ); 

m  runs  over  1, . . . ,  M;  p,  q  run  over  1 , . . . ,  N;  g,h,k  run  over  1, . . . ,  K\  and  ip(x)  is  the  derivative 
of  the  log-gamma  function, 


The  Expected  Value  of  the  Log  of  a  Dirichlet  Random  Vector  The  computation  of  the  lower 
bound  for  the  likelihood  requires  us  to  evaluate  E,,  [  log np  ]  for  p  =  1 , ,N.  Recall  that  the 
density  of  an  the  exponential  family  distributions  with  natural  parameter  9  can  be  written  as 

p(x | a)  =  h(x)  ■  c(a)  ■  exp  IE  6k(a)  ■  tk(x)  } 

k 

=  h(x)  ■  exp  {  Y  dk(a)  ■  tk(x)  -  logc(a)  }  . 

k 
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Omitting  the  node  index  p  for  convenience,  the  density  of  the  Dirichlet  distribution  p3  can  be 
rewritten  as  an  exponential  family  distribution, 

p3(jr\d)  =  exp  |  ~  !)  logOfc)  -  l°g  }’ 

with  natural  parameters  9k(a)  =  (ak  —  1)  and  natural  sufficient  statistics  4(7?)  =  log(7rfc).  Let 
d (6)  =  c(o\  (0), . . . ,  olk{0))',  using  a  well  known  property  of  the  exponential  family  distributions 
(Schervish,  1995)  it  follows  that 

E(y  [  log  7 Tk  ]  =  Eg-  [  log  tk(x)  ] 

r) 

=  —  — —  logc'  (  9  +  1  )  (Schervish,  1995,  Thm  2.64) 
o9k 

—  Tp  ( 9k  + 1 )  —4  ( 'y  ^  9k  +  K  ) 

k 

=  4  (  ak  )  -ip  (  ^ak  ), 

k 

where  U'(x)  is  the  derivative  of  the  log-gamma  function,  . 


Variational  E  Step  The  approximate  lower  bound  for  the  likelihood  Ca  ( q ,  0)  can  be  maximized 
using  exponential  family  arguments  and  coordinate  ascent  (Wainwright  and  Jordan,  2003). 

Isolating  terms  containing  <f>^qig  and  ^_qjh  we  obtain  C^q<g{q,  0)  and  C^qg(q,  0).  The 
natural  parameters  g£Cq  and  g™ _q  corresponding  to  the  natural  sufficient  statistics  log(^4(?)  and 
log(z^_q)  are  functions  of  the  other  latent  variables  and  the  observations.  We  find  that 

log  74,9  +  ZP^q,h  ■  /  (  q),B(g,h)), 

h 

log 7 Tq,h  +  z^(hg  ■  f  (  Rm(p,q),B(g,h )  ), 

9 


9p^q,g 

nm 

Vp^q,h 
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for  all  pairs  of  nodes  (p,  q )  in  the  rn-th  network;  where  g,  h  =  1. ... .  AT,  and 


/  (  Rm(p,q),B(g,h )  )=  Rm(p,  q)  log B(g,  h)+  (  1  -  Rm(p,q )  )  log  (  1  -  B(g,h)  )  . 


This  leads  to  the  following  updates  for  the  variational  parameters  for  a  pair  of  nodes 

(p,  q)  in  the  rn-th  network: 


tr 

^p—>q,g 


^p^q,h 


(X  e 


=  e 


=  e 


cx  e 


=  e 


=  e 


!  [iog^P.s]  .  g  E ht^-qX  E9  ?),£(<?, ft))] 

'[log7rpJ  .  JJ  B(g,h)RmM-  (  1  -  5(0,  ft)  )1-Rm(p,9)>'  ^ 

ft  ' 

'  [log  TTq,^]  .  g  E g<t>ZCq,g-  E9  [/  (p,<?) ,B(g,h))  ] 

-  [log^, J  .  Y[  (  B{g ,  h)Rm[M)-  (  1  -  5(0,  /i)  )  1“-Rra(p’,) X 


for  g,h  =  1, ....  A'.  These  estimates  of  the  parameters  underlying  the  distribution  of  the  nodes’ 
group  indicators  and  need  be  normalized,  to  make  sure  ]Efc  <f>™  k  =  ^pEq,k  =  1- 

Isolating  terms  containing  gP)k  we  obtain  Clpk{q,  0).  Setting  equal  to  zero  and  solving 
for  7 Pifc  yields: 


+ E  E  77,* + E  E  77*. 

m  q  m  q 

for  all  nodes  p  6  V  and  k  =  1, . . . ,  K. 

Th et-th  iteration  of  the  variational  E  step  is  carried  out  for  fixed  values  of 
and  finds  the  optimal  approximate  lower  bound  for  the  likelihood  Ca*  (q,  0^_11). 
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Variational  M  Step  The  optimal  lower  bound  £A*  ■  0)  provides  a  tractable  surrogate  for 

the  likelihood  at  the  t-th  iteration  of  the  variational  M  step.  We  derive  empirical  Bayes  estimates 
for  the  hyper-parameters  0  that  are  based  upon  it.4  That  is,  we  maximize  £A *(E_1\  0)  with 
respect  to  0,  given  expected  sufficient  statistics  computed  using  £a*(E_1\  0^1-)). 

Isolating  terms  containing  a  we  obtain  Cc;(q,  0).  Unfortunately,  a  closed  form  solution  for  the 
approximate  maximum  likelihood  estimate  of  a  does  not  exist  (Blei  et  al.,  2003).  We  can  produce 
a  Newton-Raphson  method  that  is  linear  in  time,  where  the  gradient  and  Hessian  for  the  bound  £„• 
are 


d£g 

dak 

dCs 

dotk^  otk  2 


N  ^  (£  oik  )  -ip(atk)^J  +Y1  (  ^(7p,k)  ~  ^  (  ^2  TP,k  )  ^ 


N 


I(fcl=fc2) 


ek  i)  -  e  ( 

k 


Isolating  terms  containing  B  we  obtain  Cb,  whose  approximate  maximum  is 


(  Y,p,qRm(PN)  ■  (P^qn(p 
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p^qh 
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p,q 


i)m  , 

p^qq 


p<  qh 


for  every  index  pair  ( g ,  h )  E  [1,  K]  x  [1,  A"]. 

In  Section  3.1.2  we  introduced  an  extra  parameter,  p,  to  control  the  relative  importance  of  pres¬ 
ence  and  absence  of  interactions  in  likelihood,  i.e.,  the  score  that  informs  inference  and  estimation. 
Isolating  terms  containing  p  we  obtain  Cp.  We  may  then  estimate  the  sparsity  parameter  p  by 


P  = 
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Alternatively,  we  can  fix  p  prior  to  the  analysis;  the  density  of  the  interaction  matrix  is  estimated 


4We  could  term  these  estimates  pseudo  empirical  Bayes  estimates,  since  they  maximize  an  approximate  lower 
bound  for  the  likelihood,  £A* . 
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with  d  =  p  q  H-fiiip-  (i)/(N2M),  and  the  sparsity  parameter  is  set  to  p  —  (1  —  d).  This  latter 
estimator  attributes  all  the  information  in  the  non-interactions  to  the  point  mass,  i.e.,  to  latent 
sources  other  than  the  block  model  B  or  the  mixed  membership  vectors  v?1:Ar.  It  does  however 
provide  a  quick  recipe  to  reduce  the  computational  burden  during  exploratory  analyses.5 


Smoothing  In  problems  where  the  number  of  clusters  is  deemed  to  be  likely  large  a-priori,  we 
can  smooth  the  (consequently  large  number  of)  cluster-to-cluster  relation  probabilities  encoded  in 
the  block  model  B  by  positing  that  all  the  elements  B(g,  h)  of  the  block  model  are  non-observable 
samples  from  a  common  (prior)  distribution.  In  the  admixture  of  latent  blocks  model  we  posit 
that  p(B  |  A)  is  a  collection  non-symmetric  beta  distributions,  with  a  pair  of  hyper-parameters  A 
common  to  all  elements  of  B. 


Example  17  (Continued)  Sampson  (1968)  surveyed  18  novice  monks  in  a  monastery  and  asked 
them  to  rank  the  other  novices  in  terms  of  four  sociometric  relations',  like/dislike,  esteem,  personal 
influence,  and  alignment  with  the  monastic  credo.  Sampson’s  original  analysis  strongly  suggests 
the  existence  of  tight  factions  among  the  novices,  and  the  events  that  took  place  during  his  stay  at 
the  monastery  support  his  observations;  briefly,  novices  of  one  faction  left  the  monastery  or  were 
expelled  over  religious  differences.  The  factions  identified  by  Sampson  provide  a  credible  gold 
standard,  to  which  the  results  are  compared. 

I  consider  Breiger’s  collation  of  Sampson’s  data  (Breiger  et  al.,  1975).  Briefly,  for  each  of 
the  four  sociometric  relations  above,  only  the  top  three  choices  of  each  novice  were  recorded 
as  positive  relations — the  edges  in  the  graph.  The  union  of  all  positive  relations,  disregarding 
multiplicity  as  in  Handcock  et  al.  (2007),  is  the  starting  point  of  our  analysis.  To  assess  model  fit, 

5Note  that  p  =  p  in  the  case  of  single  membership.  In  fact,  that  implies  <j>^L,qg  =  4i^-qh  =  1  f°r  some  [g,  h)  pair, 
for  any  (p,  q)  pair. 
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I  use  an  approximation  to  BIC: 

BIC  =  2  ■  log p(R)  ~  2  •  log p(/2| 7r,  Z,  a,  B )  —  \a,  B\  ■  log  \R\, 


where  \a,B\  is  the  number  of  hyper-parameters  in  the  model,  and  \R\  is  the  number  of  positive 
relations  observed — following  arguments  in  Handcock  et  al.  (2007).  The  approximate  BIC  value 
suggests  that  the  relations  among  monks  in  the  monastery  studied  by  Sampson  are  best  explained 
by  a  model  with  three  factions,  independently  of  the  number  of  hyper-parameters  in  the  fitted  ALB 
models.  In  the  left  panel  of  Figure  3.2 1  show  the  approximate  BIC  for  a  model  with  a  single  hyper¬ 
parameter,  a  scalar.  Hence  I  fixed  K  =  3  in  subsequent  analyses,  which  involved  ALB  models 
with  increasing  degree  of  complexity.  The  right  panel  of  Figure  3.2  shows  the  estimated  faction- 
to-faction  block  model,  B,  corresponds  to  a  full  model  (i.e.,  no  constraints  on  B).  This  estimate 
suggest  that  the  Outcasts  are  an  isolated  faction,  whereas  Young  Turks  like  members  of  the  Loyal 
Opposition,  although  the  sentiment  is  not  reciprocated.  Figure  3.3  investigates  the  the  posterior 
means  of  the  mixed  membership  scores,  E[7r|f?],  for  the  18  monks  in  the  monastery  ( a  =  0.058 
scalar,  B  :=  I3).  There  is  a  panel  for  each  monk,  and  the  subscripts  associated  with  the  names  of 


Number  of  Clusters  Outcasts  Loyal  Opposition  Young  Turks 

Figure  3.2:  The  approximate  BIC  (left  panel)  suggests  the  relations  among  monks  are  best  ex¬ 
plained  by  a  model  with  three  factions.  The  faction-to-faction  estimated  relational  patterns  (right 
panel)  suggest  that  the  Outcasts  are  an  isolated  faction,  whereas  Young  Turks  like  members  of  the 
Loyal  Opposition,  although  the  sentiment  is  not  reciprocated. 
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Figure  3.3:  The  posterior  mixed  membership  scores,  n,  for  the  18  monks  in  the  monastery.  Each 
panel  correspond  to  a  monk;  the  Y  axis  measures  the  grade  of  membership,  corresponding  to  the 
Outcast  (left  bar),  to  the  Young  Turks  (center  bar),  and  to  the  Loyal  Opposition  (right  bar),  on  the 
X  axis.  The  subscripts  associated  with  the  names  of  the  monks  specify  the  order  according  to 
which  they  left  the  monastery. 


Figure  3.4:  Original  matrix  of  sociometric  relations  (left),  and  estimated  relations  obtained  by 
thresholding  the  posterior  expectations  ttp  ' B  iiq\R  (center),  and  ©p ' B  (j)q\R  (right). 
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Figure  3.5:  Mixed  membership  vectors,  7t1:18,  plotted  in  the  reference  simplex.  Marks  correspond 
to  individual  monks;  the  red  circle  marks  correspond  to  an  ALB  model  with  ( B  =  /3 ,  a  =  0.01), 
whereas  the  blue  triangle  marks  correspond  to  an  ALB  model  with  ( B  :=  /3,  a  =  0.058);  where 
IK  is  the  K -dimensional  identity  matrix. 

the  monks  specify  the  order  according  to  which  they  left  the  monastery,  e.g.,  John  left  first.  The 
three  factions  on  the  X  axis  are  the  Outcast,  the  Young  Turks  ,  and  the  Loyal  Opposition  (from 
left  to  right);  and  the  Y  axis  measures  the  degree  of  membership  of  monks  to  factions.  From 
these  panels,  the  centrality  of  the  role  played  by  John  and  Greg,  first  to  leave  the  monastery,  as 
well  as  the  uncertain  affiliations  of  Romul,  and  Victor  to  a  minor  extent,  unequivocally  emerge. 
The  mixed  membership  vectors,  7?1:18,  provide  us  with  low-dimensional  representations  of  monks. 
Figure  3.5  plots  them  in  their  natural  space,  that  is,  the(3-dimensional)  simplex.  Dots  correspond 


87 


3. 1 .  ADMIXTURE  OF  LATENT  BLOCKS  MODEL 


E.M.  AIROLDl 


to  monks;  the  red  circles  were  obtained  by  fixing  B  =  I3  and  a  =  0.01,  whereas  the  blue  triangles 
correspond  to  fixing  B  :=  I3,  but  estimating  a  =  0.058.  To  compare  the  latent  representation  of 
the  monks  obtained  with  ALB  with  the  one  presented  in  (Handcock  et  al.,  2007,  Table  1),  I  mapped 
the  contour  levels  for  their  the  estimated  mixture  of  three  Gaussians  (Handcock  et  al.,  2007,  Table 
1)  in  the  reference  simplex — using  the  following  transformation, 


T  = 


-0.5  0.5  0 

^  0.5 


The  contour  levels  of  such  density  in  Figure  3.5  suggest  that  our  model  and  the  latent  space  mixture 
model  lead  to  different  structures  and  somewhat  different  interpretations. 


Example  18  (Continued)  The  goal  of  the  analysis  here  is  to  analyze  proteins’  diverse  functional 
roles  by  analyzing  their  local  and  global  patterns  of  interaction.  The  biochemical  composition  of 
individual  proteins  make  them  suitable  for  carrying  out  a  specific  set  of  cellular  operations,  or  func¬ 
tions.  Proteins  typically  carry  out  these  functions  as  part  of  stable  protein  complexes  (Krogan  et  al., 
2006).  There  are  many  situations  in  which  proteins  are  believed  to  interact  (Alberts  et  al.,  2002); 
the  main  intuition  behind  our  methodology  is  that  pairs  of  protein  interact  because  they  are  part  of 
the  same  stable  protein  complex,  i.e.,  co-location,  or  because  they  are  part  of  interacting  protein 
complexes  as  they  carry  out  compatible  cellular  operations. 

The  Munich  Institute  for  Protein  Sequencing  (MIPS)  database  was  created  in  1998  based  on 
evidence  derived  from  a  variety  of  experimental  techniques,  but  does  not  include  information  from 
high-throughput  data  sets  (Mewes  et  al.,  2004).  It  contains  about  8000  protein  complex  associ¬ 
ations  in  yeast.  We  analyze  a  subset  of  this  collection  containing  871  proteins,  the  interactions 
amongst  which  were  hand-curated.  The  institute  also  provides  a  set  of  functional  annotations,  al¬ 
ternative  to  the  gene  ontology  (GO).  These  annotations  are  organized  in  a  tree,  with  15  general 
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Table  3.1:  General  functional  categories  in  the  MIPS  tree,  and  their  relative  popularity.  In  the 
table  we  report  the  number  of  proteins  that  have  at  least  one  functional  annotation  in  the  general 
categories  in  the  left  column.  Counts  refer  to  the  subset  of  87 1  proteins  in  yeast,  which  are  part  of 
the  hand-curated  MIPS  interaction  network. 


# 

Category 

Size 

1 

Metabolism 

125 

2 

Energy 

56 

3 

Cell  cycle  &  DNA  processing 

162 

4 

Transcription  (tRNA) 

258 

5 

Protein  synthesis 

220 

6 

Protein  fate 

170 

7 

Cellular  transportation 

122 

8 

Cell  rescue,  defence  &  virulence 

6 

9 

Interaction  w/  cell,  environment 

18 

10 

Cellular  regulation 

37 

11 

Cellular  other 

78 

12 

Control  of  cell  organization 

36 

13 

Sub-cellular  activities 

789 

14 

Protein  regulators 

1 

15 

Transport  facilitation 

41 

functions  at  the  first  level,  72  more  specific  functions  at  an  intermediate  level,  and  255  annotations 
at  the  the  leaf  level.  In  Table  3. 1  we  map  the  87 1  proteins  in  our  collections  to  the  main  functions  of 
the  MIPS  annotation  tree;  proteins  in  our  sub-collection  have  about  2.4  functional  annotations  on 
average.6  By  mapping  proteins  to  the  15  general  functions,  we  obtain  a  15 -dimensional  representa¬ 
tion  for  each  protein.  In  Figure  3.6  each  panel  corresponds  to  a  protein;  the  15  functional  categories 
are  ordered  as  in  Table  3.1  on  the  X  axis,  whereas  the  presence  or  absence  of  the  corresponding 
functional  annotation  is  displayed  on  the  Y  axis. 

Protein-protein  interactions  (PPI)  form  the  physical  basis  for  formation  of  complexes  and  path¬ 
ways  which  carry  out  different  biological  processes.  A  number  of  high-throughput  experimental 
approaches  have  been  applied  to  determine  the  set  of  interacting  proteins  on  a  proteome-wide  scale 

6We  note  that  the  relative  importance  of  functional  categories  in  our  sub-collection,  in  terms  of  the  number  of 
proteins  involved,  is  different  from  the  relative  importance  of  functional  categories  over  the  entire  MIPS  collection. 
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Figure  3.6:  By  cutting  the  MIPS  annotation  tree  at  the  first  level  we  find  the  15  general  func¬ 
tional  categories  in  Table  3.1.  By  mapping  proteins  to  the  15  general  functions,  we  obtain  a  15- 
dimensional  representation  for  each  protein.  In  the  Figure,  each  panel  corresponds  to  a  protein; 
the  15  functional  categories  are  displayed  on  the  X  axis,  whereas  the  presence  or  absence  of  the 
corresponding  functional  annotation  is  displayed  on  the  Y  axis.  The  plots  at  the  bottom  zoom  into 
the  panels  corresponding  to  three  example  proteins. 


90 


CHAPTER  3.  DISCOVERING  LATENT  PATTERNS 


E.M.  AIROLDl 


in  yeast.  These  include  the  two-hybrid  (Y2H)  screens  and  mass  spectrometry  methods.  For  exam¬ 
ple,  mass  spectrometry  is  used  to  identify  components  of  protein  complexes  (Gavin  et  al.,  2002; 
Ho  et  al.,  2002).  High-throughput  methods,  though,  may  miss  complexes  that  are  not  present  under 
the  given  conditions.  For  example,  tagging  may  disturb  complex  formation  and  weakly  associated 
components  may  dissociate  and  escape  detection.  Statistical  models  that  encode  information  about 
functional  processes  with  high  precision  are  then  an  essential  tool  for  carry  out  probabilistic  de¬ 
mising  of  biological  signals  from  high-throughput  experiments. 

In  previous  work,  we  established  the  usefulness  of  the  admixture  of  latent  blocks  model  for 
analyzing  protein-protein  interaction  data.  For  example,  we  used  the  ALB  for  testing  functional  in¬ 
teraction  hypotheses  and  semi-supervised  and  unsupervised  estimation  experiments  (Airoldi  et  al., 
2005b).  We  then  attempted  to  assess  whether,  and  how  much,  functionally  relevant  biological  sig¬ 
nal  can  be  captured  in  by  the  ALB  model  (Airoldi  et  al.,  2005a).  In  summary,  our  findings  show 
that  ALB  identifies  protein  complexes  whose  member  proteins  are  tightly  interacting  with  one  an¬ 
other.  The  identifiable  protein  complexes  correlate  with  the  following  four  categories  of  Table  3.1: 
cell  cycle  &  DNA  processing,  transcription,  protein  synthesis,  and  sub-cellular  activities.  The  high 
correlation  of  inferred  protein  complexes  can  be  leveraged  for  predicting  the  presence  of  absence 
of  functional  annotations,  for  example,  by  using  a  logistic  regression.  However,  there  is  not  enough 
signal  in  the  data  to  independently  predict  annotations  in  other  functional  categories.  The  empiri¬ 
cal  Bayes  estimates  of  the  hyper-parameters  that  support  these  conclusions  in  the  various  types  of 
analyses  are  consistent;  a  <  1  and  small;  and  B  nearly  block  diagonal  with  two  positive  blocks 
comprising  the  four  identifiable  protein  complexes.  These  previous  analyses  fixed  the  number  of 
latent  protein  complexes  to  15.  Figure  3.7  displays  few  examples  of  predicted  mixed  membership 
probabilities  against  the  true  annotations,  given  an  estimated  mapping  of  latent  protein  complexes 
to  functional  categories.  The  latent  protein  complexes  are  not  a-priori  identifiable  in  our  model. 
To  resolve  this,  we  find  mapping  between  latent  complexes  and  functions  by  minimizing  the  diver- 
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Figure  3.7:  Predicted  mixed-membership  probabilities  (dashed,  red  lines)  versus  binary  manually 
curated  functional  annotations  (solid,  black  lines)  for  6  example  proteins.  The  identification  of 
latent  groups  to  functions  is  estimated,  and  it  is  discussed  in  Figure  3.8. 
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gence  between  true  7  and  predicted  marginal  frequencies  of  membership.  We  used  this  mapping 
to  compare  predicted  versus  known  functional  annotations,  for  all  proteins.  The  best  estimated 
mapping  is  shown  in  Figure  3.8. 

Following-up  on  the  hypothesis  that  the  size  of  stable  protein  complex  in  Yeast  is  about  5 
proteins  on  average,  and  skewed  towards  bigger  complexes  (Krogan  et  al.,  2006),  we  explored 
a  richer  space  of  models  with  K  =  50, ... ,  225.  However,  using  approximate  BIC  to  assess 
model  fit  (Handcock  et  al.,  2007)  we  found  that  the  more  parsimonious  models  (K  =  50)  provide 
a  better  description  of  the  observed  interactions.  This  fact  is  consistent  with  previous  findings 
(Airoldi  et  al.,  2005b),  and  suggest  that  the  interactions  in  the  MIPS  collection  alone  encode  a  bi¬ 
ological  signal  at  a  higher  aggregation  level  than  that  of  a  specific  complexes.  In  order  to  explore 
this  hypothesis  we  considered  an  alternative  annotation  scheme  to  that  of  the  Munich  Institute  for 
Protein  Sequencing;  namely  the  Saccaromice  Cervisiae  gene  database  and  gene  ontology  (GO) 
(Ashbumer  et  al.,  2000).  Based  on  the  GO,  Myers  et  al.  (2006)  recently  proposed  a  solid  frame¬ 
work  to  assess  the  functional  content  of  biological  data.  Making  use  of  it,  we  measure  the  func¬ 
tional  content  in  the  interactions  encoded  in  an  ALB  model  with  K  =  50,  fitted  using  the  nested 
variational  EM  algorithm  detailed  in  the  Appendix.  In  Figure  3.9,  we  measure  the  functional  con¬ 
tent  in  the  posterior  means, 

E  [  R(p,q)  =  1  ]=  TTp1  B  TTq  and  E  [  R(p,  q)  =  1  ]  =  (fp^q '  B  (pp^q, 

where  positive  interactions  are  obtained  by  thresholding  the  expectations.  Figure  3.9  shows  the 
original  MIPS  collection  as  one  of  the  most  precise  (Y  axis)  and  most  extensive  ( X  axis)  source 
of  biologically  relevant  interactions  available  to  date.  The  posterior  means  of  (7f1:Ar)  and  the  esti¬ 
mates  of  ( a ,  B )  provide  a  parsimonious  representation  for  the  MIPS  collection,  and  lead  to  precise 
interaction  estimates,  however,  in  moderate  amount  (the  light  blue,  —  x  line).  The  posterior  means 

Evaluated  on  a  small  fraction  of  the  interactions. 
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Figure  3.8:  We  estimate  the  mapping  of  latent  groups  to  functions.  The  two  plots  show  the  marginal 
frequencies  of  membership  of  proteins  to  true  functions  (bottom)  and  to  identified  functions  (top), 
in  the  cross-validation  experiment.  The  mapping  is  selected  to  maximize  the  accuracy  of  the  pre¬ 
dictions  on  the  training  set,  in  the  cross-validation  experiment,  and  to  minimize  the  divergence 
between  marginal  true  and  predicted  frequencies  if  no  training  data  is  available — see  the  text. 
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■  Gavin  et  al.  Aft.  precipitation 

•  Ho  et  al.  Aft.  precipitation 

A  Tong  et  al.  Synthetic  lethality 

♦  Uetz  et  al.  Two  hybrid 

Ito  (2000)  et  al.  Two  hybrid 
X  Ito  (2001)  et  al.  Two  hybrid 
▼  Tong  et  al.  Two  hybrid 
M  Fromont-Racine  et  al.  Two  hybrid 
►  Drees  et  al.  Two  hybrid 

■  Gasch  (2001)  et  al.  Microarray  expression 

•  Gasch  (2000)  et  al.  Microarray  expression 
A  Spellman  et  al.  Microarray  expression 

♦  Mewes  (2004)  et  al.  MIPS  (871) 

+  Airoldi  (2006)  et  al.  MIPS  via  ALB  (K=50,  Zs) 
X  Airoldi  (2006)  et  al.  MIPS  via  ALB  (K=50,  Pis) 
-  -  Random 


Figure  3.9:  In  the  top  panel  we  measure  the  functional  content  of  the  the  MIPS  collection  of  protein 
interactions  (yellow  diamond),  and  compare  it  against  other  published  collections  of  interactions 
and  microarray  data,  and  to  the  posterior  estimates  of  ALB  models — computed  as  described  in 
the  text.  A  breakdown  of  three  estimated  interaction  networks  (the  numbered  points)  into  most 
represented  gene  ontology  categories  is  detailed  in  Table  3.2. 


of  (Z-*,  Z~)  do  not  provide  a  parsimonious  representation  for  the  data,  and  describe  most  of  the 
functional  content  of  the  MIPS  collection  with  high  precision  (the  dark  blue,  — b  line).  A  break¬ 
down  of  three  example  interaction  networks  displayed  in  Figure  3.9  into  most  represented  gene 
ontology  categories  is  detailed  in  Table  3.2.  We  investigate  the  correlations  between  data  collec¬ 
tions  (rows)  and  a  sample  of  gene  ontology  categories  (columns).  The  intensity  of  the  square  (red 
is  high)  measures  the  area  under  the  precision-recall  curve  For  more  detail  about  these  plots  see 
Figures  5-6  in  Myers  et  al.  (2006). 


*  *  * 

When  applied  to  a  sample  of  measurements  on  pairs  of  objects,  Admixture  of  Latent  Blocks  simul¬ 
taneously  extracts  information  about  (i)  the  mixed  membership  of  objects  to  latent  aspects,  and  (ii) 
the  connectivity  patterns  among  latent  aspects,  using  a  nested  variational  EM  algorithm.  I  found  it 
useful  for  revealing  group  membership  in  social  networks,  as  well  as  for  describing  and  summa¬ 
rizing  the  functional  content  of  a  protein  interaction  network,  and  I  envision  its  use  for  de-noising 
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Table  3.2:  Breakdown  of  three  example  interaction  networks  into  most  represented  gene  ontology 
categories.  The  digit  in  the  first  column  refers  to  the  numbered  points  in  Figure  3.9.  The  last  two 
columns  quote  the  number  of  predicted,  and  possible  pairs  for  each  GO  term. 


# 

GO  Term 

Description 

Pred. 

Tot. 

1 

G0:0043285 

Biopolymer  catabolism 

561 

17020 

1 

GO: 00063 66 

Transcription  from  RNA  polymerase  II  promoter 

341 

36046 

1 

GO: 00064 12 

Protein  biosynthesis 

281 

299925 

1 

G0:0006260 

DNA  replication 

196 

5253 

1 

GO: 0006461 

Protein  complex  assembly 

191 

11175 

1 

G0:0016568 

Chromatin  modification 

172 

15400 

1 

GO: 0006473 

Protein  amino  acid  acetylation 

91 

666 

1 

GO: 00063 60 

Transcription  from  RNA  polymerase  I  promoter 

78 

378 

1 

G0:0042592 

Homeostasis 

78 

5778 

2 

G0:0043285 

Biopolymer  catabolism 

631 

17020 

2 

GO: 00063 66 

Transcription  from  RNA  polymerase  II  promoter 

414 

36046 

2 

G0:0016568 

Chromatin  modification 

229 

15400 

2 

GO: 0006260 

DNA  replication 

226 

5253 

2 

GO: 00064 12 

Protein  biosynthesis 

225 

299925 

2 

GO: 0045045 

Secretory  pathway 

151 

18915 

2 

GO: 0006793 

Phosphorus  metabolism 

134 

17391 

2 

G0:0048193 

Golgi  vesicle  transport 

128 

9180 

2 

GO: 00063 5 2 

Transcription  initiation 

121 

1540 

3 

G0:0006412 

Protein  biosynthesis 

277 

299925 

3 

GO: 0006461 

Protein  complex  assembly 

190 

11175 

3 

G0:0009889 

Regulation  of  biosynthesis 

28 

990 

3 

G0:0051246 

Regulation  of  protein  metabolism 

28 

903 

3 

GO: 0007046 

Ribosome  biogenesis 

10 

21528 

3 

GO: 00065 12 

Ubiquitin  cycle 

3 

2211 

new  collection  of  interactions  from  high-throughput  experiments. 

A  recurring  question,  which  bears  relevance  to  mixed  membership  models  in  general,  is  why 
one  does  not  necessarily  want  to  integrate  out  the  single  membership  indicators — (z£L>q,  z%Lq)  in 
the  specifications  above.  There  are  some  computational  aspects  to  this  but  a  practical  issue  that 
argues  against  such  marginalization  is  that  we  would  often  lose  interpretable  quantities  that  are 
useful  for  making  predictions,  for  de-noising  new  measurements,  or  for  performing  other  tasks. 
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Figure  3.10:  We  investigate  the  correlations  between  data  collections  (rows)  and  a  sample  of  gene 
ontology  categories  (columns).  The  intensity  of  the  square  (red  is  high)  measures  the  area  under 
the  precision-recall  curve. 


In  fact,  the  posterior  distributions  of  such  quantities  typically  carry  substantive  information  about 
elements  of  the  application  at  hand.  In  the  application  to  protein  interaction  networks,  for  example, 
they  encode  the  interaction-specific  memberships  of  individual  proteins  to  protein  complexes. 

There  is  a  tight  relationship  between  ALB  and  the  latent  space  models  in  Hoff  et  al.  (2002); 
Handcocket  al.  (2007).  In  the  latent  space  models,  the  latent  vectors  are  drawn  from  Gaussian 
distributions  and  the  interaction  data  is  drawn  from  a  Gaussian  with  mean  np'\nq.  In  ALB,  the 
marginal  probability  of  an  interaction  takes  a  similar  form,  7 fp '  Bn q,  where  B  is  the  matrix  of  prob¬ 
abilities  of  interactions  for  each  pair  of  latent  factions.  In  contrast  to  the  latent  space  model,  the  re¬ 
lations  can  be  modeled  by  an  arbitrary  distribution,  in  our  model.  With  binary  relations  a  collection 
of  Bernoulli  parameters  can  be  used;  with  continuous  relations,  a  collection  of  Gaussian  parameters 
can  be  used.  While  more  flexible,  ALB  does  not  subsume  latent  space  models;  they  make  different 
assumptions  about  the  data.  See  Handcock  et  al.  (2007)  with  discussion  (Blei  and  Fienberg,  2007; 
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Airoldi,  2007)  for  more  details. 


3.2  Local  Diffusion  Potentials 

Here  I  briefly  situate  in  the  context  of  this  thesis  some  recent  developments  in  mathematics  that 
bear  relevance  to  statistical  network  analysis.  I  wish  to  thank  Ann  B.  Lee  for  carrying  out  the 
calculations  and  for  generous  advice  on  the  material  presented  here. 

The  main  linkage  between  the  mathematics  of  diffusion  (Lafon  and  Lee,  2006)  and  statistical 
network  analysis  is  a  notion  of  distance  that  measures  the  connectivity  between  nodes  through 
a  multiple-step  multiple-path  diffusion  process  on  the  graph.  This  formulation  depends  on  non¬ 
observable,  node-specific  quantities,  which  I  term  local  diffusion  potential.  Higher-order  connec¬ 
tivity  patterns  could  be  naturally  incorporated  into  statistical  models  of  graphs  and  networks  in 
diffusion  space  (Coifman  et  al.,  2005a, b). 

Example  19.  Consider  the  diffusion  of  innovation  among  physicians  studied  by  Coleman  et  al. 
(1957).  Doctor  A  suggests  doctor  B  to  try  a  new  drug,  who  later  suggests  its  use  to  doctor  C.  In 
a  sense,  the  influence  of  doctor  A  indirectly  extends  to  doctor  C.  Such  mediated  connections  may 
occur  through  multiple  steps  and  multiple  paths.  Defining  a  distance  metric  that  explicitly  encodes 
this  enriched  notion  of  connectivity,  that  is,  the  diffusion  potential  specific  to  a  doctor  (to  a  node 
in  a  graph),  is  the  focus  of  this  section. 

3.2.1  Goals  of  the  Analysis 

An  abstract  framework  to  study  diffusion  has  been  introduced  by  Coifman  et  al.  (2005a, b)  in  the 
context  of  high-dimensional  data  analysis  and  manifold  learning.  In  such  a  framework  nodes  in  a 
graph  are  represented  in  terms  of  their  multivariate  attributes,  xn  6  Rp  for  each  n  e  A f.  A  kernel 
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function  /  ^x»-yll  j ,  with  a  certain  bandwidth  h,  defines  the  local  neighborhood  of  xn,  and  it  used 
to  compute  weights  of  the  edges  to  be  imputed.  This  procedure  results  in  a  (fully  connected)  graph 
among  the  nodes.  Within  this  framework,  a  notion  of  distance  has  been  defined  that  controls  the 
influence  of  a  each  node  on  its  neighbors,  through  a  multiple-step  multiple-path  diffusion  process 
on  the  graph,  at  a  global  scale  (Lafon  et  al.,  2006). 

Rather,  for  the  purposes  of  this  thesis,  a  graph  is  given.  It  is  possible,  however,  to  measure 
distance  between  nodes  through  a  multiple-step  multiple-path  diffusion  process  on  the  graph  by 
defining  local  diffusion  potentials  in  the  form  of  local  scales,  tn,  specific  to  nodes  n  G  Af.  Dis¬ 
tances  in  the  graph,  based  upon  local  diffusion  potentials,  correspond  to  Euclidean  distances  on 
the  lower  dimensional  manifold  implicitly  defined  by  the  graph.8 

3.2.2  Technical  Preliminaries 

Consider  a  connected  graph  G  —  ( V,  E ),  where  V  is  a  set  of  N  vertices,  and  E  is  a  set  of  undirected 
edges.  Edges  are  mapped  to  weights  Wij,  for  i,j  e  V.  The  weights  W  =  {wij}  satisfy  the 
following  conditions:  (i)  symmetry,  W  =  WT,  (ii)  pointwise  positivity,  >  0  for  i,jeV  and 
Wu  >  0,  and  (iii)  positive  semi-definiteness.  These  conditions  can  be  relaxed,  and  the  methodology 
extended,  to  cover  the  more  general  case  of  directed  graphs. 

Consider  the  spectral  properties  of  the  Markov  chain  on  W.  The  transition  matrix  P  has  a  set 
of  left  and  right  eigenvectors  according  to: 

4>lP  =  Xk4>l  and  =  A k'tpk  (3-5) 

where  the  eigenvalues  A0  =  1  >  Ai  >  ...  >  A^_]  >  0.  Furthermore,  the  left  and  right  eigen¬ 
vectors  satisfy  the  biorthogonality  relation  4%'iPi  =  dy,  where  dy  is  Dirac’s  delta  function.  For 
Explicitly  defined  by  Xi-.n  £  JR7'  in  the  formulation  of  Coifman  et  al.  (2005a,b). 
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convenience,  the  eigenvectors  are  normalized  with  respect  to  1/0O  and  0O,  respectively,  so  that: 


Hk\\2i/<i>0 

mi 


1 

1. 


(3.6) 


It  can  be  verified  that  A0  =  1,  ipo  =  1,  and  that  0O  is  defined  as  in  Eq.  3.11.  Furthermore,  it  follows 
that 

I  ( -\  t'l  n\ 

W)  =  —JY  (3-7) 

Mv 

for  k  =  0,1,...,  N  —  1  and  i  e  V.  Rewrite  the  transition  probabilities  pt('i-j  )  and  the  diffu¬ 
sion  metric  in  terms  of  these  eigenvectors  and  eigenvalues.  By  inserting  the  biorthogonal  spectral 
decomposition 

Pt(i,j)  =  (3.8) 

fc>0 

into  Eq.  3.10,  and  using  orthonormality  ]TV  =  0.-u  it  follows 


V2(n,m]tn,tm)  =  22  (^k^kin)  -  Aj:m-0fc(m))‘ 

fc>0 

K 

-  22 ( xtkMn )  - 


k= 1 


(3.9) 


Note  that  the  k  =  0  term  does  not  appear  in  the  sum  as  Aq  =  1  and  ip  =  1. 
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3.2.3  The  Main  Result 

The  calculations  above  lead  to  a  generalized  diffusion  distance  between  nodes  n  and  m  according 
to 


(3.10) 


where  the  scale  parameters  tn  and  tm  determine  the  local  influence  of  nodes  n  and  m  on  their 
neighbors,  and  the  function 


(3.11) 


is  the  stationary  distribution  of  the  Markov  chain,  i.e.  limt_,+0O  pt(i,  j )  =  ©0 (j )  •  According  to  this 
metric,  nodes  n  and  m  will  be  close,  if  they  interact  with  the  same  nodes  in  the  graph  9 .  For  an 
undirected  graph,  such  a  situation  occurs  when  there  are  many  paths  connecting  the  two  nodes. 

The  main  points  are  the  following: 

•  From  Eq.  3.9,  it  is  clear  that  the  diffusion  metric  is  a  distance  on  the  graph  induced  by  a 
one-parametric  family  of  eigenmaps 


A  2^2  (n) 

— > 


(3.12) 


\  A K^K{n) 


for  n  G  V. 


l)The  weights  l/^o  ( j )  penalize  discrepancies  on  nodes  of  lower  degree  more  than  differences  on  neighboring  nodes 
of  higher  degree. 
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•  Effectively,  we  only  need  to  keep  the  first  K  terms,  where  K  -C  N,  in  the  identity  for  I). 
The  accuracy  of  the  approximation  depends  on  the  value  of  K,  the  speed  of  the  decay  of  the 
eigenvalues  1  >  A*  >  0  (for  i  —  1,  2, . . . ,  N  —  1),  and  the  exponent  t.  For  a  fixed  accuracy, 
a  larger  t  implies  fewer  terms  in  the  sum. 

In  other  words,  Eq.  3.10  ca  be  expressed  as  a  Euclidean  distance 

V2(n,m]tn,tm )  ^  II^W  -  l|2  (3-13) 

in  a  low-dimensional  “diffusion  space”.  The  coordinates  of  nodes  n  and  m  in  this  space  are  given 
by  a  diffusion  map  at  time  scales  tn  and  tm,  respectively. 

*  *  * 

In  this  chapter,  I  introduced  stochastic  block  models  of  mixed  membership,  which  extend  block 
models  (Holland  and  Leinhardt,  1975)  to  include  mixed-membership  in  a  hierarchical  Bayesian 
framework.  I  presented  summaries  of  two  successful  applications  of  such  models  in  the  context 
of  social  and  protein  interaction  networks  (Airoldi  et  al.,  2006c, d,  2007b).  I  discussed  similarities 
and  differences  between  stochastic  block  models  of  mixed  membership  and  latent  space  models 
(Hoff  et  al.,  2002;  Handcock  et  al.,  2007;  Airoldi,  2007;  Blei  and  Fienberg,  2007).  I  concluded  by 
situating  in  the  context  of  this  thesis  some  recent  developments  in  the  mathematics  of  diffusion 
that  bear  relevance  to  the  proposed  methodology  for  statistical  network  analysis. 
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In  the  previous  chapter,  I  developed  generative  models  of  networks  where  the  identity  of  each  node 
was  the  only  attribute  that  was  observed.  In  this  chapter,  I  develop  Bayesian  mixed-membership 
models  of  objects’  attributes.  I  then  develop  models  where  such  objects  are  the  nodes  of  a  graph. 
The  resulting  integrated  models  can  accommodate  measurements  on  relations  and  attributes  involv¬ 
ing  objects  of  different  types,  along  with  the  corresponding  sets  of  latent  variables,  in  a  hierarchical 
Bayesian  framework.  I  describe  a  multivariate  generalization  of  models  of  attributes  and  relations 
that  is  amenable  to  theoretical  analysis — to  be  pursued  in  future  work.  This  modeling  effort  in¬ 
forms  a  discussion  of  alternative  strategies  for  integrating  complex  data.  Two  flavors  of  integration 
strategies  emerge  that  are  best  suited  to  support  descriptive  and  predictive  analyses. 


4.1  Heavy-Tailed  Attributes 

In  this  section,  I  develop  statistical  models  for  estimating  latent  patterns  from  attribute  data  with 
a  heavy-tailed  distribution.  The  notion  of  contagion,  i.e.,  the  dependence  among  multiple  occur¬ 
rences  of  the  same  attribute  is  introduced  to  express  variability  profiles  induced  by  heavy  tails. 
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Furthermore,  contagion  is  a  convenient  analytical  formalism  to  characterize  semantic  themes  such 
as  biological  context.  Model  variants  tailored  to  different  properties  of  the  data  are  explored,  and 
a  general  scheme  for  approximate  posterior  inference  is  presented,  which  is  based  on  variational 
methods. 

Example  20.  A  fundamental  problem  in  the  serial  analysis  of  gene  expression  (SAGE)  data  is 
that  of  identifying  temporal  patterns  of  gene  expression  (i.e.,  latent  distributions  over  a  predeter¬ 
mined  sequence  of  epochs)  that  can  help  explain  a  biological  process  from  a  large  pool  of  observed 
temporal  gene  expression  profiles  (Blackshaw  et  al.,  2004;  Cai  et  al.,  2004).  The  set  of  latent  ex¬ 
pression  patterns  can  then  be  used  for  suggesting  hypotheses  and  further  analyses,  or  for  making 
predictions. 

Example  21.  A  recent  problem  in  text  and  natural  language  processing  is  that  of  identifying  topics, 
i.e.,  latent  distributions  over  words  in  the  vocabulary,  that  best  explain  a  collection  of  documents 
(Minka  and  Lajferty,  2002;  Blei  et  al.,  2003;  Erosheva  et  al.,  2004;  Blei  and  Lafferty,  2006).  The 
set  of  topics  provides  a  low -dimensional  representation  of  each  document  and  can  be  used  for 
organizing  and  browsing  the  collection  of  documents  efficiently. 

The  description  of  the  methodology  in  this  section  exploits  the  intuition  developed  in  the  bio¬ 
logical  context  of  Example  20. 

From  a  methodological  perspective,  the  task  of  identifying  latent  temporal  patterns  is  essen¬ 
tially  an  allocation  problem;  observed  gene  expression  profiles  need  be  allocated  to  latent  tempo¬ 
ral  patterns.  The  goal  is  to  make  inference  on:  (i)  the  number  of  latent  patterns,  (ii)  a  numerical 
description  of  the  patterns  themselves,  and  (iii)  the  mixed  membership  of  the  observed  gene  ex¬ 
pression  profiles  to  latent  patterns.  This  is  an  instance  of  the  more  general  problem  of  allocating 
observed  sequences,  i.e.,  longitudinal  representations  of  objects  in  terms  of  an  attribute,  to  latent 
sequential  patterns,  where  each  observation  is  allowed  to  be  the  measurable  manifestation  of  more 
than  one  pattern.  In  the  context  of  serial  analysis  of  gene  expression  (SAGE),  Cai  et  al.  (2004) 
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introduce  a  variant  of  K -means  algorithm  that  minimizes  a  non-standard  scoring  function,  which 
combines  the  Chi-square  statistic  (to  measure  the  strength  of  co-expression)  with  the  Poisson  dis¬ 
tribution  (to  measure  the  likelihood  of  the  expression  level  of  genes  at  each  epoch).  Approaches 
based  on  clustering  methods,  however,  constrain  the  expression  level  of  a  gene  at  each  epoch  to 
follow  the  expression  profile  typical  of  a  single  pattern.  In  other  words,  such  approaches  entail 
unique  membership  of  observations  to  patterns,  rather  than  mixed  membership. 

Models  of  mixed  membership  have  been  successfully  applied  in  the  context  of  different  prob¬ 
lems  (e.g.,  Pritchard  et  al.,  2000;  Rosenberg  et  al.,  2002;  Xing  et  al.,  2003a;  Minka  and  Lafferty, 
2002;  Blei  et  al.,  2003;  Griffiths  and  Steyvers,  2004;  Buntine  and  Jakulin,  2004;  Blei  and  Lafferty, 
2006).  Such  models,  however,  fall  short  of  accommodating  the  marginal  variability  profiles  of 
observed  attributes,  jeopardizing  the  accuracy  and  the  interpretability  of  the  inferences.  Existing 
models  appear  to  be  unsuitable  for  the  biological  application  to  the  SAGE  data  in  part  because 
of  the  assumption  of  independence ,  as  discussed  in  Section  4.1.1.  Below,  I  shall  refer  to  popular 
models  based  on  such  an  assumption  as  independence  models. 


4.1.1  The  Data  and  Goals  of  the  Analysis 

Serial  analysis  of  gene  expression  (SAGE)  is  a  technology  that  quantitatively  measure  the  copy 
numbers  of  mRNA  transcripts,  simultaneously  for  a  large  number  of  genes  in  a  biological  sample, 
such  as  a  cell  population  or  a  tissue  (Vesculescu  et  al.,  1995).  This  technology  is  used  to  aid 
the  discovery  of  gene  expression  profiles  that  characterize  functional  processes  of  interest,  and  to 
compare  and  catalog  new  genes. 

A  SAGE  experiment  begins  by  sampling  a  total  of  B  transcripts  at  random  from  a  biological 
sample  under  some  specific  condition  (e.g.,  a  cell  cycle  stage),  and  then  use  N  gene-specific  tags  to 
probe  the  existence  of  possible  genes  in  each  of  the  B  transcripts.  Let  Xb  =  (Xbl,Xb2, . . . ,  XbN)T , 
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such  that  Xbn  e  {0, 1}  and  Xbn  =  1,  be  a  unit-base  indicator  vector  recording  the  probing 
results  for  transcript  b  (i.e.,  Xbn  =  1  indicates  that  gene  n  is  present  on  transcript  b ).  The  number 
of  mRNA  copies  of  a  gene  n,  denoted  by  Yn,  and  the  vector  of  copy  counts  for  all  genes  (i.e.,  an 
expression  profile),  Y  =  (Y\ ,  Y2, . . . .  YN)T ,  can  then  be  simply  expressed  as: 

B  B 

Yn  =  YJxbn,  Y  =  J2X»-  (4-D 

b= 1  b=l 

Note  that  Kn’s  are  each  binomial  distributed,  controlled  by  gene-specific  parameters  pi:\-  each 
captures  the  probability  of  occurrence  of  gene  on  a  random  transcript,  and  a  common  sample  size 
parameter  B.  When  multiple  cellular  conditions  are  of  interest,  e.g.,  stage  sequences  in  a  cell 
cycle,  an  additional  index  will  denote  the  specific  conditions,  e.g.,  Y*,  for  measurements  obtained 
at  time  t. 

The  main  random  quantities  of  interest  are:  the  observed  gene  expression  levels  Y*’s,  for  the  n- 
th  gene  at  the  t-th  epoch;  the  observed  gene  expression  profiles  YfiT’ s,  for  the  n-th  gene;  and  the 
latent  gene  expression  patterns,  e.g.,  plpT  or  A];7  ,  for  the  k-th  theme,  as  defined  in  Pritchard  et  al. 
(2000)  and  in  the  basic  model  of  Sections  4.1.2,  respectively.  Technically,  the  latent  gene  expres¬ 
sion  patterns  are  multivariate  emission  probabilities  for  the  gene  expression  levels,  conditionally 
on  the  active  membership  of  that  gene.  The  notation  I  adopt  puts  forward  the  set  of  parameters  un¬ 
derlying  a  specific  distribution,  e.g.,  A \T  is  a  vector  of  Poisson  rates,  which  control  the  expression 
levels  of  those  genes  that  are  expressed  according  to  the  k-th  pattern.  For  example,  whenever  the 
n-th  gene  is  expressed  according  to  the  k-th  pattern  I  shall  write 

YnT  ~  [  pois(  Afc), . . . ,  Pois{ \l)  ]  . 

Analytical  Justifications  of  Contagion  Occurrences  of  the  same  gene  under  single  and  multiple 
conditions  are  not  independent  of  one  another,  because  they  are  sampled  from  a  cell  population  or 


106 


CHAPTER  4.  COMPLEXITY  AND  INTEGRATION 


E.M.  AIROLDI 


a  tissue  that  provides  a  specific  biological  context.  Contagion  processes  provide  a  useful  analytical 
mechanism  to  capture  this  notion.  The  two  proposed  generative  models  for  analyzing  temporal 
gene  expression  profiles  {Yre1:T}^=1,  that  instantiate  the  contagion  process,  are  based  on  the  Poisson 
and  the  negative-binomial  distributions  of  integer  counts,  at  multiple  levels.  For  a  review  of  various 
parameterizations  of  the  negative-binomial  and  the  corresponding  estimators  refer  to  Airoldi  et  al. 
(2005c),  Johnson  et  al.  (1992)  and  Kadane  et  al.  (2006). 

These  choices  were  motivated  by  few  main  considerations.  The  Poisson  distribution  offers  a 
computational  advantage  over  the  binomial  distribution.  It  can  be  safely  assumed  that  the  gene- 
specific  probabilities  of  occurrence  pi.N  are  very  small,  given  that  there  is  a  large  amount  of  tran¬ 
scripts  present  in  a  specific  biological  sample.  Consequently,  it  is  reasonable  as  well  as  computa¬ 
tionally  efficient  to  approximate  the  binomial  probabilities  with  Poisson  probabilities.  The  sam¬ 
pling  algorithms  underlying  both  the  Poisson  and  negative -binomial  distributions  lead  to  marginal 
and  conditional1  distributions  for  the  gene  expression  levels  with  desirable  properties.  Assum¬ 
ing  Poisson  or  negative-binomial  conditional  emission  probabilities  relaxes  the  assumption  that, 
in  the  (sequential)  sampling  process  described  in  Section  4.1.1,  subsequent  observed  instances 
of  the  same  gene  tag  are  independent.  In  fact,  such  independence  leads  to  binomial  conditional 
emission  probabilities  (Pritchard  et  al.,  2000).  The  dependence  among  different  observations  of 
the  same  gene  tag  at  the  conditional  level  is  one  aspect  of  the  notion  of  contagion.  Another  as¬ 
pect  of  contagion  is  found  at  the  marginal  level.  Recall  that  ideally  patterns  can  be  interpreted 
as  biological  or  functional  contexts.  Following  the  intuition  that  each  gene  may  be  expressed  un¬ 
der  multiple  biological  contexts  to  a  different  degree,  the  probability  of  observed  gene  expression 
levels,  Yf,  is  modeled  as  a  mixture  of  conditional  emission  probabilities,  where  the  gene-specific 
mixture  weights  given  by  the  mixed  membership  vectors,  9n,  are  constant  over  time  (or  across 
experimental  conditions).  The  mixing  leads  to  marginal  distributions  that  are  more  skewed  than 
the  corresponding  conditional  distributions  and  this  is  the  contagion  effect  one  is  most  likely  to 

'Conditionally  on  the  active  membership. 
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encounter  in  the  literature  (e.g.,  see  Simon,  1955).  For  example,  in  the  case  where  the  conditional 
probabilities  are  Poisson,  their  mixing  would  increase  the  variability  of  the  expression  levels.  A 
formal  model  of  contagion  that  encodes  this  intuition  is  the  negative-binomial  model,  which  arises 
as  an  infinite  Gamma  mixture  of  Poisson  distributions.  These  arguments  support  our  distribu¬ 
tional  choices.  Furthermore,  the  marginal  distributions  that  encode  contagion  fit  well  the  observed 
expression  levels. 

To  summarize,  contagion  processes  are  the  result  of  latent  regularities  present  in  structured  data, 
such  as  the  SAGE  profiles  under  study.  The  fact  that  genes  may  be  expressed  in  multiple  biologi¬ 
cal  contexts  implies  a  hierarchical  mixture  of  emission  probabilities,  which  ultimately  leads  to  the 
over-dispersion  of  gene  expression  levels.  Although  this  second  characteristic  of  contagion  pro¬ 
cesses  is  more  common  in  the  literature,  there  is  an  subtle  point  to  notice  in  latent  aspect  models 
that  feature  independence  of  subsequent  observed  instances  of  the  same  gene  tag  (Pritchard  et  al., 
2000;  Minka  and  Lafferty,  2002;  Blei  et  al.,  2003).  Specifically,  if  themes  are  modeled  as  multi¬ 
nomial  distributions,  then  Dirichlet  distributed  mixing  weights  will  not  alter  the  mean-to-variance 
ratio  of  the  marginal  distribution,  which  is  still  multinomial.  Rather,  the  main  effect  of  mixing  is 
an  increased  variability. 

Empirical  Evidence  The  data  set  that  motivates  this  modeling  effort  is  the  set  of  mouse  retinal 
SAGE  libraries  analyzed  in  Cai  et  al.  (2004).  The  raw  mouse  retinal  data  consists  of  10  SAGE 
libraries  (38,818  unique  genes  that  appeared  more  than  twice  in  the  sample)  from  developing  retina 
taken  at  2-day  intervals,  ranging  from  embryonic  day  to  postnatal  day,  and  adult,  for  total  of  10 
epochs  (Blackshaw  et  al.,  2004).  Of  the  38,818  genes,  1,467  that  appeared  more  than  20  times  in 
at  least  one  of  the  10  libraries  were  selected.  These  1,467  genes  were  purported  as  the  potentially 
most  biologically  relevant  because  of  their  high  frequency  of  occurrence.  The  data  analyzed  in  this 
paper  consists  of  the  pool  of  observed  expression  profiles  (Y^,  Y n2, . . . ,  Fn10)  for  the  1,467  selected 
genes,  measured  at  ten  epochs  during  the  development  period.  Before  fitting  the  models,  I  tested 
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Table  4.1:  Methods-of-Moments  estimates  of  negative-binomial  parameters  for  gene  expression 
levels  in  mouse  retinal  cells  of  at  10  different  stages  of  development  Cai  et  al.  (2004).  A  discussion 
of  the  estimators  is  given  in  Airoldi  et  al.  (2005c). 


Epoch 

mean 

var. 

WaT. 

mean 

<7 

£ 

1 

30.1172 

150.8648 

2.2381 

11.1733  ±0.3655 

4.3000  ±  0.2155 

2 

26.5542 

163.8892 

2.4843 

9.8514  ±0.4075 

6.1021  ±0.3304 

3 

28.1718 

155.4820 

2.3493 

10.4516  ±0.2936 

2.9376  ±0.1448 

4 

31.5446 

204.2503 

2.5446 

11.7029  ±0.3267 

3.2591  ±0.1588 

5 

26.0307 

94.4013 

1.9043 

9.6572  ±0.4154 

6.4720  ±  0.3562 

6 

26.6489 

82.0171 

1.7543 

9.8866  ±0.2118 

1.5748  ±0.0795 

7 

27.3122 

82.0405 

1.7331 

10.1327  ±0.2491 

2.1565  ±0.1066 

8 

25.1990 

53.6102 

1.4586 

9.3487  ±  0.2637 

2.6407  ±0.1319 

9 

27.1513 

89.7169 

1.8178 

10.0730  ±  0.4472 

7.2014  ±  0.4008 

10 

20.8160 

81.2509 

1.9757 

7.7226  ±  0.5975 

16.8959  ±  1.3156 

the  distributional  assumptions  discussed  in  Section  4.1.1  on  the  SAGE  data  at  hand. 

Table  4.1  reports  summary  statistics  and  estimates  for  the  negative-binomial  parameters  de¬ 
scribed  in  Airoldi  et  al.  (2005c).  The  exploratory  data  analysis  confirms  the  expected  over-dispersion 
of  the  gene  counts,  entailed  by  the  mixture  of  Poisson  distributions  assumption.  Moreover,  the  esti¬ 
mates  of  the  extra-Poissonness  parameter  5  are  all  positive2  with  very  high  probability,  as  indicated 
by  a  quick  inspection  of  the  corresponding  standard  deviations.  Lastly,  I  note  that  the  log  transfor¬ 
mation  (  =  log(l  +  5)  is  effective  in  reducing  the  heavy  tail  of  the  distribution  of  S.  Thus,  it  is 
preferable  to  work  on  the  (  scale,  where  a  simple  prior  is  sensible. 

In  conclusion,  the  SAGE  data  analyzed  here  are  over-dispersed,  i.e.,  variance  >  mean.  Thus 
models  that  treat  the  random  variables  {Xf  B}  as  Bernoulli  processes  (e.g.,  Pritchard  et  al.,  2000; 
Rosenberg  et  al.,  2002)  are  not  appropriate  for  the  SAGE  data  at  hand.  Such  an  assumption 
leads  to  clustering  models  based  on  Multinomial  latent  patterns  and  binomial  emission  probabil¬ 
ities  for  feature  counts  (Blei  et  al.,  2003;  Griffiths  and  Steyvers,  2004;  Buntine  and  Jakulin,  2004; 
Blei  and  Lafferty,  2006),  which  are  not  warranted  in  this  context. 

2Recall  that  as  6  — >■  0  the  negative-binomial  density  degenerates  into  a  Poisson  density. 
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Figure  4.1:  Graphical  representation  of  the  generative  processes  of  contagion  based  on  the  Pois¬ 
son  (top  left)  and  negative-binomial  sampling  schemes.  The  representation  for  the  processes  of 
contagion  based  on  the  Poisson  sampling  scheme  for  the  non-basic  models  are  easily  obtained, 
by  removing  the  part  of  the  graphical  models  depending  on  5.  In  fact,  recall  that  5  is  the  extra- 
Poissonness  parameter,  and  as  5  — >•  0  the  negative-binomial  density  converges  to  the  corresponding 
Poisson  limit.  See  Johnson  et  al.  (1992)  for  more  details. 

4.1.2  Model  Specifications 

In  this  section,  I  fully  specify  the  two  hierarchical  Bayesian  generative  processes  for  allocating 
SAGE  profiles  to  temporal  expression  patterns  in  an  unsupervised  fashion.  These  models  cap¬ 
ture  biological  context  through  the  notion  of  contagion.  Recall  that  the  observations  consist  of 
sequences  of  counts  ( Y7) .  F„2, . . . ,  Y'[)  that  measure  the  abundances  of  the  n-th  gene  in  the  target 
cell  or  tissue  across  epochs  1  though  T.  The  models  introduced  below  need  the  following  two 
assumptions:  (i)  a  fixed  number,  K,  of  latent  expression  profiles  exists;  (ii)  genes  are  expressed 
under  different  profiles  to  different  degrees  (mixed  membership). 

Poisson  Generative  Process  The  first  generative  process  is  based  on  the  Dirichlet  and  Poisson 
distributions.  There  are  four  flavors  of  the  Dirichlet-Poisson  generative  process:  basic  (bDiP), 
normalized  (nDiP),  conditional  (cDiP),  and  smoothed  (sDiP). 
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The  basic  model  explicitly  posits  the  mixed-membership  of  genes  to  latent  patterns  by  associat¬ 
ing  each  gene  with  a  Dirichlet  vector  of  probabilities,  6n.  The  observed  expression  profile  Y*:T  of 
the  n-th  gene,  assuming  K  latent  expression  profiles,  is  generated  as  follows. 

1.  Sample  6n  ~  Dirichlet  K  (a) 

2.  For  each  epoch  t  —  1, . . . ,  T 

2.1.  Sample  zln  ~  Multinomial  (9n,  1) 

2.2.  Sample  y ^  ~  Poisson  (Xtkl^f,  =  1). 

The  genes  are  the  sampling  units  in  SAGE  experiments,  and  the  total  volume  of  their  expres¬ 
sions  often  vary  over  time.  Recovering  calibrated  expression  profiles  that  do  not  depend  on  the 
total  expression  volume  is  desirable.  To  this  extent,  I  posit  the  normalized  model,  which  rescales 
the  samples  (i.e.,  the  genes)  according  to  their  different  sizes  (the  total  expression  volumes),  and 
ultimately  improves  the  estimates.  In  the  basic  model,  the  matrix  A  =  {A^}  contains  the  rates 
that  govern  the  expression  level  of  genes  at  T  different  epochs  for  each  of  the  K  different  latent 
profiles.  In  the  normalized  model,  the  expected  expression  level  of  the  n-th  gene  at  time  t  for 
profile  k  is  written  as  follows, 

A tk  n  '  btki  (4.2) 

where  c on  is  scalar  and  observed,  and  denotes  the  total  expression  level  of  the  n-th  gene  as  a 
multiple  of  a  fixed  total  expression  level  j3  used  as  a  reference  expression  level.  This  new  parameter 
(3  may  a  fixed  pre-determined  value,  estimated  via,  e.g.,  empirical  Bayes  (Carlin  and  Louis,  2005), 
or  given  a  distribution  as  part  of  a  full  Bayesian  analysis  (Airoldi  et  al.,  2006a). 

Note  1.  In  both  the  basic  and  the  normalized  models  above,  the  rows  of  the  parameter  matrices  A 
and  pi  control  the  rates  at  which  genes  are  expressed.  In  particular,  Xtk  and  //,/,  encode  the  expected 
expression  level  of  genes  at  time  tfor  profile  k.  Since  profiles  are  by  definition  not  observable,  none 
of  these  parameters  can  be  estimated  directly  from  the  data. 
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Rows  of  the  normalized  rate  matrix  jl  are  reparameterized  with  the  sum/ratio  parameterization, 
i.e.,  for  every  epoch  t  the  following  transformation  is  applied 

(pti,  Pt2,  •  •  • ,  IHk)  — *■  (crt,  Pti,Pt2,  •  •  • ,  Pttc),  (4.3) 

where  the  sum  parameter  ot  :=  Y^k=i  Ptk,  the  ratio  parameters  ptk  '■=  and  the  constraint  that 
Y^k= i  Ptk  —  1  makes  the  ratio  parameter  ptK  redundant  for  each  t.  This  reparameterization  leads 
to  the  conditional  model,  where  the  sum  parameters  (cri,  cr2,  ■  ■  . ,  crT)  are  directly  estimable  from 
the  data,  and  inference  can  be  can  carried  out  conditionally  on  them.  This  is  possible  since  the 
parameters  a1:T  encode  the  total  normalized  expression  levels  at  time  t  (sum  of  the  expression 
levels  over  the  K  latent  patterns),  which  is  an  observable  quantity  as  it  does  not  depend  on  the 
latent  profiles.  Conditioning  on  the  MLEs  for  the  total  expression  parameters,  ot,  leads  to  a  new 
allocation  problem  where  the  differential  expression  levels  of  genes  under  the  K  profiles  needs  be 
inferred.  In  other  words,  the  total  expression  level  at  each  time  t  needs  be  allocated  among  the 
latent  patterns,  given  a  constraint  on  their  sum  and  a  direct  estimate  of  ot. 

Lastly,  I  introduce  the  smoothed  model,  which  posits  a  prior  for  the  differential  expression 
rate  parameters  to  smooth  their  estimates.  In  the  smoothed  model  I  assume  that  the  differential 
expression  levels  are  sampled 

pt .  ~  Dirichlet  k  (/3) 

for  each  epoch  t  =  1,  2, . . . ,  T.  See  Figure  4.1.  In  principle,  it  is  possible  to  posit  a  prior  distri¬ 
bution  on  the  total  expression  rate  parameters  as  well.  A  brief  analysis  of  the  observed  total  rates 
suggests  that  it  is  appropriate  to  apply  a  logarithmic  transformation  on  them  to  stabilize  the  vari¬ 
ability,  and  one  can  introduce  a  Gaussian  prior  on  the  transformed  rates;  however,  an  inspection 
of  the  total  rates  ot  over  time  (see  Table  4.1)  suggests  that  some  other  phenomenon  is  possibly 
going  on,  which  leads  to  a  decreasing  occurrence  of  the  genes  in  the  SAGE  libraries.  Therefore 
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the  observed  total  rates  are  used  to  inform  our  inferences  directly,  as  in  the  conditional  model. 
Smoothing  the  overall  rates  {atk}  would  impose  a  model  on  data  that  cannot  be  justified,  since  it 
is  not  clear  why  the  overall  rates  are  declining.  This  would  cast  some  doubts  on  the  interpretability 
of  the  inferences  such  a  model  would  lead  to. 

Summarizing,  the  Dirichlet-Poisson  generative  process  possesses  a  few  advantages:  (i)  the  sam¬ 
pling  scheme  encodes  contagion  in  the  sense  that  multiple  occurrences  of  the  same  gene  tag  at  the 
same  epoch  depend  on  one  another,  given  their  active  memberships  to  a  specific  latent  expres¬ 
sion  pattern;  (ii)  the  sampling  scheme  arises  naturally  in  SAGE  biological  experiments  discussed 
in  Section  4.1.1;  (iii)  computing  Poisson  probabilities  is  more  efficient  than  computing  binomial 
probabilities,  since  binomial  coefficients  need  not  be  evaluated. 


Negative-Binomial  Generative  Process  The  generative  process  of  contagion  based  on  the  neg¬ 
ative-binomial  sampling  scheme  is  similar  in  spirit  to  the  previous  one  based  on  the  Poisson  sam¬ 
pling  scheme.  A  formal  treatment  of  the  models  is  given  in  Airoldi  et  al.  (2005c).  Intuitively,  the 
negative-binomial  distribution  has  two  parameters  that  control  mean  and  variance;  furthermore, 
its  variance  is  always  greater  than  its  mean — a  useful  property  that  replicates  the  observed  over¬ 
dispersion  of  gene  expression  levels.  The  negative-binomial  density  can  be  written  as  a  Poisson 
density  with  an  extra  parameter  5  that  controls  the  amount  of  extra-Poisson  variability.  Thus, 


N  B  (  yn  |  c Unfit,  u>nSt) 


r(yh  +  «t)  (ujt)yin 
^!T(«t)  (l+o 


where  :=f  for  convenience  of  notation.  In  the  normalized  model,  {fit k}  are  the  profile-specific 
Poisson  rates  and  {<5tfc}  are  profile-specific  extra-Poissonness  parameters.  The  conditional  model 
then  follows  from  the  application  of  the  sum/ratio  parameterization  (see  Equation  4.3)  to  both  sets 
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of  parameters 


(ikl,  ■  ■  ■  )  PtK)  - *  (&t,  Ptli  Pt2,  ■  ■  ■  ,  PtK) 

(dtl,  St2,  ■  ■  •  ,  Stx)  - *  (6)  Vtl,  11t2i  ■  ■  ■  ,  VtK)- 

Lastly,  the  smoothed  model  imposes  probabilistic  constraints  on  both  the  differential  expression 
levels  and  the  differential  extra-Poissonness  parameters  by  assuming  that  they  are  independent 
samples  from  two  Dirichlet  distributions  with  distinct  sets  of  underlying  constants, 

pt .  ~  Dirichlet  k  ((3)  and  rff .  ~  Dirichlet  K  (7), 

for  each  epoch  t  —  1,  2, . . . ,  T.  See  Figure  4.1. 

4.1.3  Estimation  and  Inference 

In  order  to  obtain  the  posterior  for  the  latent  variables, 

P  (  {&n,ZnT} n=l  \  {Vn^n^  {AifcT}f=l  )  = 

=  P  (  {0n,ZnT}n=l,  {ynT}n=l  \  {^:T}f=l  )  ..  .. 

P  (  {VnT}n= 1  |  01,  {A^:T}f=1  ) 

one  needs  to  evaluate  the  likelihood  in  Expression  4.4,  which  is  given  by  an  integral  with  no  closed 
form  solution — the  denominator.  Thus  I  develop  a  mean-field  approximation  to  the  posterior, 
which  involves  the  substitution  of  an  integrable  lower  bound  for  the  likelihood.  The  mean-field 
approximation  involves  positing  a  simple  distribution,  q,  over  the  latent  variables,  which  depends 
upon  an  extra  set  of  (variational)  free  parameters,  {vn,  4>liT}n=i  in  this  case.  The  free  parame¬ 
ters  are  then  set  to  minimize  the  Kullback-Leibler  divergence  between  the  true  and  approximate 
posteriors.  This  is  equivalent  to  maximizing  a  lower  bound  for  the  likelihood  within  each  E-step, 


114 


CHAPTER  4.  COMPLEXITY  AND  INTEGRATION 


E.M.  AIROLDl 


over  the  free  parameters,  and  then  compute  pseudo-expectations  for  the  latent  variables  using  the 
maximized  lower  bound.  The  overall  inference  algorithm  is  a  variational  EM  scheme.  At  each 
iteration,  the  EM  algorithm  employs  the  mean-field  approximation  to  carry  out  the  E-step  (just 
discussed  above)  and  then  employs  a  regular  M-step,  where  the  maximum  likelihood  estimates 
of  the  model  parameters,  e.g.,  (a,  (Aj|.:T}j[L1)  for  the  basic  model,  are  updated  by  maximizing  the 
lower  bound  for  the  likelihood,  over  such  parameters.  These  two  steps  are  iterated  till  convergence 
of  the  lower-bound  for  the  likelihood. 

The  variational  EM  scheme  just  described  practically  translates  into  a  coordinate  ascent  algo¬ 
rithm,  where  parameters  are  naturally  organized  into  batches  with  similar  semantics.  The  parame¬ 
ter  updates  corresponding  to  the  model  variants  considered  above  are  summarized  in  Table  4.2. 

A  General  Bayesian  Formalism  for  Latent  Aspects  Analysis  The  variational  inference  scheme 
developed  for  the  two  models  of  counts  is  actually  quite  general.  In  fact,  the  free  parameter  updates 
(that  are  used  to  maximize  the  lower  bound  for  the  likelihood  within  each  E-step)  take  a  generic 
form  applicable  to  all  different  conditional  emission  probability  functions  considered  above,  e.g., 
Table  4.2.  Furthermore,  for  generic  conditional  emission  probabilities  p(ytn\/3tk)  for  all  (n,t,  k), 
with  parameter  set  {f3kT}k=1,  the  following  general  free  parameter  updates  can  be  used 

Ktk  °C  T  -P  {Vi  |  Pk  ), 

where  T  :=  eEq []og °rC  as  jn  Table  4.2.  The  updates  for  u*k  =  ak  +  cj)ntk  remain  unchanged. 

The  generality  of  the  approximate  E-step  in  latent  aspects  analysis  that  feature  one  latent  group 
indicator,  zfn,  for  each  gene-epoch  pair  (n,  t)  is  due  the  specific  hierarchical  formulation  of  our 
models.  Such  a  formulation  posits  exchangeable  measurements  on  features,  e.g.,  gene  expression 
levels  at  each  epoch.  Different  conditional  emission  probabilities  only  lead  to  different  estimators 
for  the  corresponding  parameters,  {j3kT}k= v  in  the  M-step. 
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Table  4.2:  The  table  quotes  the  parsimonious  mean-field  approximation  for  the  various  mod¬ 
els.  The  parsimonious  mean-field  approximation  posits  one  latent  expression  profile  indicator 
z  for  each  (gene, epoch)  pair.  Note  that  T  :=  eE'' -log °ah^ ,  and  Po,  NB,  are  short  for  Poisson, 
and  Negative- Binomial,  respectvely.  **  Alternatively  use  the  Method  of  Moments  described  in 
Airoldi  et  al.  (2005c)  pretending  to  observe  pseudo  counts  {Kk  ■  y'n}  as  the  expression  levels  of 
the  n-th  gene  according  to  the  k-th  latent  theme. 


Poisson 

Negative-Binomial 

Basic 

^nk  ~~  +  Y2t  4^ntk 

Ktk  «  T  ■  P°t{  Vn  Xtk  ) 

tk  E„  't’ntk 

a*k  with  Newton-Raphson 

Norm. 

^nk  —  ak  +  Y2t  tfrntk 

Ktk  °C  T  ■  P°t{  Vn  ^nTtk  ) 

'  ^  En  4,ntkUJn 

V nk  —  ak  +  $ ntk 

Ktk  oc  T  •  NB  (  ytn\  ujnfitk  ) 

,  4* ntkUn 

^ tk  ~  En  4>utk^n 

5*tk  =  L-BFGS  ** 

a*k  with  Newton-Raphson 

a*k  with  Newton-Raphson 

Cond. 

^nk  —  +  Y2t  fintk 

Ktk  °C  T  '  P°  (  Vn\  UnVtPtk  ) 

P*  & ntkUn 

" tk  En  <t>ntk^ncrt 

^nk  —  ak  +  4* nth 

Ktk  OC  T  •  NB  (  ytn\  u)natPtk  ) 

p*  ^ritklJn 

Ptk  —  <t>ntkUn(rt 

rftk  =  L-BFGS  ** 

a*k  with  Newton-Raphson 

a*k  with  Newton-Raphson 

Related  Work  There  is  a  simple  connection  between  the  algorithms  developed  here  and  the 
PoissonC  and  PoissonL  algorithms  introduced  by  Cai  et  al.  (2004).  In  the  problem  at  hand  the 
goal  is  to  allocate  the  observed  temporal  expression  profiles  {Y^:T}^=1  into,  say,  K  patterns  or 
clusters.  Recall  that  the  K -means  unsupervised  clustering  algorithm  searches  for  K  means  rri  i : k 
that  minimize 

1  K  N 

MSE  =  N  Y  E  1  ( yT  e  k )  ||rfT  -  mt\\ 2 . 

k= 1  n=  1 

That  is,  the  means  mi:K  are  the  centers  of  K  clusters  in  the  sense  of  Euclidean  norm.  The  Pois¬ 
sonC  and  PoissonL  algorithms  introduced  by  Cai  et  al.  (2004)  substitute  the  Euclidean  norm  in  the 
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equation  with  the  chi- squared  score, 


and  the  negative  log-likelihood, 


T 


e  (Atfc  ‘s")  (Atfc  0n)y™ 
yV- 


t(n,k)  =  log 


t= 1 


respectively.  The  normalized  model  based  on  the  Poisson  distribution  is  an  extension  of  the  Pois- 
sonL  algorithm,  where  Dirichlet  distributed  mixed-membership  vectors  are  introduced,  9n,  not 
known  in  advance.  In  the  PoissonL  algorithm  the  mixed-membership  vectors  9n  are  known,  i.e., 
for  the  n-th  gene  it  follows  that 


1  if  k=  jn 

0  otherwise, 


where  jn  =  argrnin  {  L(n,  k)  :  k  G  [1,  A']  j.  This  extension  is  similar  in  spirit  to  that  introduced 
by  Gaussian  mixture  to  regular  A'- means  (Blei  and  Fienberg,  2007).  In  fact, 


9nk  =  Ar  (  cluster  =  k  |  data,  parameter s  )  . 


Note  that  introducing  latent  Dirichlet  distributed  mixed-membership  vectors,  9n,  ties  together  all 
the  data  in  the  inference  task.  This  has  the  beneficial  effect  of  reducing  the  variability  of  pattern- 
specific  parameters  since  all  the  gene  counts  are  used  (independently  of  which  pattern  they  express 
the  most)  in  estimating  each  such  parameters.  Such  an  improvement  in  the  estimates  is  expected 
James  and  Stein  (1961).  Our  basic  Poisson  model  is  similar  to  that  of  Canny  (2004).  For  a  technical 
survey  of  related  latent  aspects  models  in  the  context  of  text  data  analysis  see  Buntine  and  Jakulin 
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(2006). 


Example  4.1  (Continued)  Contagion  induces  a  non-trivial  difference  in  the  generative  pro¬ 
cess  with  respect  to  the  independence  model  (Pritchard  et  ah,  2000;  Minka  and  Lafferty,  2002; 
Blei  et  al.,  2003)  that  has  far  reaching  implications  for  the  analysis  of  data.  For  example,  models 
of  contagion  provide  a  better  fit  for  data  in  biological  applications  such  as  SAGE  by  providing  a 
realistic  mean-to-variance  marginal  ratio.  A  better  fit  helps  recovering  more  precise  mixed  mem¬ 
berships  of  genes  to  patterns,  as  well  as  finding  cleaner  temporal  expression  patterns  when  com¬ 
pared  to  those  found  by  independence  models.  This  general  issue  is  explored  further  in  Section 
6.2.1. 


Figure  4.2:  Gene  expression  themes  learned  from  mouse  retinal  SAGE  using  conditional  DiP. 
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Recall  that  the  mouse  retinal  SAGE  libraries  analyzed  in  Cai  et  al.  (2004)  contain  38,818  unique 
genes  for  total  of  10  epochs.  At  first,  I  perform  model  selection  by  means  of  a  five-fold  cross  val¬ 
idation  scheme,  to  estimate  the  plausible  number  of  latent  themes  that  best  explain  the  data.  The 
held-out  likelihood  peaked  at  15  themes  for  cDiP,  and  10  for  the  independence  model.  Figure  4.2 
shows  the  15  gene  expression  patterns  inferred  using  the  conditional  Poisson  model  (cDIP).  The 
variance  of  each  pattern  is  not  shown — the  variances  are  so  small  that  the  variance-bars  are  masked 
by  the  dot  symbol  in  our  plots.  Notably,  the  magnitude  of  the  held-out  likelihood  for  cDiP  is  about 
ten  times  larger  (on  the  log  scale)  than  that  for  the  independence  model,  suggesting  a  better  over¬ 
all  fit  of  cDiP  to  the  data.  Furthermore,  the  corresponding  mixed-membership  estimates  {9n}  are 
more  sharply  peaked;  this  result  holds  in  general  for  over-dispersed  data  sets.  See  Figure  6.2  for 
an  example.  The  result  is  indirectly  supported  by  the  estimates  of  Dirichlet  hyper-parameter  as 
well,  Q jndep  =  1-355  versus  aDiP  =  0.066.  The  patterns  (or  clusters)  shown  in  Figure  4.2  indeed 
lead  to  reasonable  predictions  of  mouse  retinal  gene  functions.  For  example,  a  preliminary  biolog¬ 
ical  validation  of  the  patterns  inferred  using  cDiP  based  on  the  GO  annotation  shows  correlation 
between  the  latent  patterns  and  gene  functions  such  as  photoreceptors  and  rhodospin,  i.e.,  genes 
with  similar  functional  annotations  tend  to  fall  into  the  same  pattern.  An  in-depth  analysis  of  the 
biological  significance  of  the  inferred  patterns  is  given  elsewhere  (Airoldi  et  al.,  2006f). 


Modeling  Choices  and  Inference  In  problems  where  attributes  co-occur  frequently  (e.g.,  a  pair 
of  genes  can  be  present  on  many  transcripts),  the  computational  gains  sought  after  by  positing 
models  that  rely  on  unrealistic  assumptions  are  seldom  achieved.  Applications  to  problems  that 
arise  in  computational  biology,  e.g.,  SAGE  and  microarray  data,  are  one  such  case.  Probabilistic 
models  that  replicate  salient  features  of  the  data  typically  lead  to  better  inferences  on  latent  quan¬ 
tities  of  interest,  e.g.,  the  latent  temporal  patterns  of  Example  20.  In  the  models  introduced  in  this 
section,  the  salient  features  of  interest  are  the  marginal  variability  and  the  notion  of  contagion. 
The  inference  suggests  that  the  inferred  latent  patterns  can  be  interpreted  as  temporal  expression 
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patterns  that  are  typical  of  fairly  distinct  functional  biological  contexts — the  desired  outcome.  This 
contrasts  the  poorly  interpretable  results  obtained  with  the  independence  model,  and  makes  a  good 
case  for  modeling  choices  that  “let  the  data  tell  their  story.”  Following  these  thoughts,  model 
variants  tailored  to  different  properties  of  biological  data  have  been  introduced,  and  the  general 
inference  scheme  for  posterior  inference  has  been  derived. 

Concluding,  the  estimates  the  proposed  models  provide  are  sharper  than  those  entailed  by  ex¬ 
istent  methods  based  on  stronger  independence  assumptions,  in  the  context  of  the  SAGE  analysis. 
This  demonstrates  the  feasibility  of  a  promising  hierarchical  Bayesian  formalism  for  soft  clustering 
and  latent  aspect  analysis. 


4.2  Multivariate  Model  Specifications 

Here  I  present  a  multivariate  generalization  of  the  hierarchical  models  of  mixed  membership  for 
attributes  and  relations. 

4.2.1  Attributes:  Hierarchical  Bayesian  Models  of  Mixed  Membership 

There  are  a  number  of  earlier  instances  of  mixed-membership  models  that  have  appeared  in  the 
scientific  literature  (e.g.,  see  Erosheva  and  Fienberg,  2005).  A  general  formulation  characterizes 
the  models  of  mixed-membership  in  terms  of  assumptions  at  four  levels  (Erosheva  et  al.,  2004). 

Assumption  1  (Population  Level).  There  are  K  classes  or  sub-populations  in  the  population  of 
interest  and  J  distinct  characteristics.  Denote  by  f(xnj\f3jk)  the  probability  distribution  of  j-th 
response  variable  in  the  k-th  sub-population  for  the  n-th  subject,  where  f3jk  is  a  vector  of  relevant 
parameters,  n  G  [1,  N],  j  G  [1,  J],  and  k  G  [1,  K\.  Within  a  subpopulation,  the  observed  responses 
are  assumed  to  be  independent  across  subjects  and  characteristics. 
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Figure  4.3:  Left:  A  graphical  representation  of  hierarchical  Bayesian  models  of  mixed- 
membership.  Right:  Models  of  text  and  references  used  in  this  paper.  Specifically,  replicates 
of  variables  {x\,  xJ2}  are  paired  with  latent  variables  {z\,  zf}  that  indicate  which  latent  aspects 
informs  the  parameters  underlying  each  individual  replicate.  The  parametric  and  non-parametric 
version  of  the  error  models  for  the  label  discussed  in  the  text  refer  to  the  specification  of  Da — a 
Dirichlet  distribution  versus  a  Dirichlet  process,  respectively. 

Assumption!  (Subject  Level).  The  components  of  the  mixed-membership  vector  9n  =  (0n-\\-  ■  ■  ■  ■  On[K])' 
represent  the  membership  of  the  n-th  subject  to  the  various  sub-populations .3  The  distribution  of 
the  observed  response  xnj  given  the  individual  membership  scores  6n,  is  then 

K 

Pr  ( Xnj\On )  ^  ^  ^n\k\  f  (■f'nj  |  ifk)  •  (4.5) 

k=\ 

Conditional  on  the  mixed-membership  scores,  the  response  variables  xnj  are  independent  of  one 
another,  and  independent  across  subjects. 


Assumption  3  (Latent  Variable  Level).  The  mixed-membership  vectors,  Q\-.n,  are  independent  re¬ 
alizations  of  a  latent  random  quantity  with  distribution  Da,  parameterized  by  vector  of  underlying 
constants  a.  The  probability  of  observing  xnj,  given  the  parameters,  is  then 


Pr  (xnj\a,/3)  = 


[k)f{xnj\Pjk)  Da{dff). 


3I  denote  components  of  a  vector  vn  with  i>nM,  and  the  entries  of  a  matrix  mn  with  mnuj] . 


(4.6) 
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Assumption  4  (Sampling  Scheme  Level).  The  R  independent  replications  of  the  J  distinct  re¬ 
sponse  variables  corresponding  to  the  n-th  subject  are  independent  of  one  another.  The  probability 
of  observing  {. xrnl , . . . ,  xrnJ}^=l,  given  the  parameters,  is  then 

J  R  K  \ 

n  n  e  dum).  (4,7) 

4=1  r= 1  k= 1  / 

The  number  of  observed  response  variables  is  not  necessarily  the  same  across  subjects,  i.e.,  J  = 
Jn.  Likewise,  the  number  of  replications  is  not  necessarily  the  same  across  subjects  and  response 
variables,  i.e.,  R  =  R,nj. 

Example  22  (Latent  Dirichlet  Allocation).  The  general  formulation  encompasses  popular  data 
mining  models  such  as  the  latent  Dirichlet  allocation  model  (LDA)  for  use  in  the  analysis  of  sci¬ 
entific  publications  (Minka  and  Lafferty,  2002;  Blei  et  al.,  2003).  Consider  a  collection  of  docu¬ 
ments;  sub-populations  correspond  to  latent  topics,  indexed  by  k;  subjects  correspond  to  “docu¬ 
ments,”  indexed  by  n;  J  =  l,  i.e.,  there  is  only  one  response  variable  that  encodes  which  “word” 
in  the  vocabulary  is  chosen  to  fill  a  position  in  a  text  of  known  length,  so  that  j  is  omitted;  positions 
in  the  text  correspond  to  replicates,  and  we  have  a  different  number  of  them  for  each  document, 
i.e.  we  observe  Rn  positions  filled  with  words  in  the  n-th  document.  The  model  assumes  that 
each  position  in  a  document  is  filled  with  a  word  that  expresses  a  specific  topic,  so  that  distinct 
instances  of  the  same  word  may  be  expression  of  different  topics.  In  order  to  do  so,  an  explicit 
indicator  variables  zrn  is  introduced  for  each  observed  position  in  each  document,  which  indicates 
the  topic  that  expresses  the  word  in  such  position.  The  function  f(xrn\(3k)  is  given  by  the  probability 
Pr  ( xrn  =  1| zrn  =  /;:),  which  is  specified  as  Multinomial  (/3k,  1),  where  3k  is  a  random  vector  the 
size  of  the  vocabulary,  say  V,  and  Y^=i  Pk[v]  —  LA  mixed-membership  vector  9n  is  associated  to 
the  n-th  document,  which  encode  the  topic  proportions  that  ultimately  inform  the  choice  of  words 
in  that  document,  and  it  is  distributed  according  to  a  Dirichlet  distribution,  which  specifies  Da. 
Equation  4.8  is  obtained  by  integrating  out  the  topic  indicator  variable  zrn  at  the  word  level — the 


Pr  ({xrnl,...,xrnJ}?=1\a,f3)  = 
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latent  indicators  zrn  are  distributed  according  to  a  Multinomial  ( 9n .  1). 

Example  23  (Grade  of  Membership).  The  Grade  of  Membership  model  ( GoM)  is  another  specific 
model  that  can  be  cast  in  terms  of  mixed-membership.  This  model  was  first  introduced  by  Wood¬ 
bury  in  the  1970s  in  the  context  of  medical  diagnosis  Woodbury  et  al.  (1978)  and  was  developed 
further  and  elaborated  upon  in  a  series  of  papers  and  in  Manton  et  al.  (1994).  Erosheva  (2002) 
reformulated  the  GoM  model  according  to  the  specifications  of  Section  4.2.1.  Consider  disabil¬ 
ity  survey  data  collected  for  the  National  Long  Term  Care  Survey;  there  are  no  replications,  i.e., 
Rn  =  1,  but  several  attributes  of  each  american  senior  are  recorded,  i.e.,  J  —  16  daily  activi¬ 
ties.  Furthermore,  the  scalar  parameter  /3jk  is  the  probability  of  being  disabled  on  the  activity  j 
for  a  member  of  latent  pattern  k,  that  is,  /3jk  =  P(xj  =  1 1 9k  =  1).  Dealing  with  binary  data 
(individuals  are  either  disabled  or  healthy),  the  probability  distribution  f(xj\(3jk )  is  specified  by 
a  Bernoulli  distribution  with  parameter  /3jk.  Therefore,  a  member  n  of  latent  profile  k  is  disabled 
on  the  activity  j,  i.e.,  xnj  =  1,  with  probability  (3jk.  In  other  words,  introducing  a  profile  indicator 
variable  znj,  we  have  P(xnj  =  1 1 znj  =  k)  =  /3jk.  Each  individual  n  is  characterized  by  a  vector 
of  membership  scores  9n  =  (0n  \ . . . . .  0nK).  In  this  model  the  membership  scores  9 n  follow  the 
distribution  Da  (for  example  a  Dirichlet  distribution  with  parameter  a  =  (a.\, . . . ,  ak, . . . ,  Ok). 
Note  that  the  ratio  ak/  ak  represents  the  proportion  of  the  population  that  “belongs”  to  the 
k-th  latent  pattern. 

Note  2  (Related  Work).  It  is  possible  to  situate  this  formulation  in  a  familiar  landscape  by  dis¬ 
cussing  similarities  with  other  unsupervised  data  mining  methods.  Recall  that  the  problem  is  to 
group  observations  about  N  subjects  {x])Rn}n=i  mto’  say>  K  groups.  K -means  clustering,  for 
example,  searches  for  K  centroids  rn  \ .  K  that  minimize 

K  N 

MSE =t v  E  E 1  ( - k )  IK;S"  -  m42 . 

k= 1  n=  1 

where  the  centroids  rn  i :  r-  are  centers  of  respective  clusters  in  the  sense  of  Euclidean  norm.  Subjects 
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have  single  group  membership  in  K -means.  In  the  mixture  of  Gaussians  model,  a  popular  model 
that  extends  K -means,  the  MSE  scoring  criterion  is  substituted  by  the  likelihood  k  £(n,  k). 
The  unknown  mixed-membership  vectors  9n  relax  the  single  membership  implicit  in  K -means.  The 
connection  is  given  by  the  fact  that  the  mixed-membership  vectors  9n,  i.e.,  the  class  abundances, 
have  a  specific  form  in  K -means,  i.e.,  for  the  n-th  subject  it  follows  that 

j  1  ifk=jn 

" n[k ]  \ 

0  otherwise, 

where  jn  =  argmin  {  i(n,  k)  :  k  E  [1,  K]  }.  In  general,  the  unknown  mixed-membership  vectors 
9n  are  independent  samples  from  Da.  Furthermore,  in  the  general  formulation  of  Section  4.2.1  it 
is  possible  to  have  more  complicated  likelihood  structures. 

4.2.2  Relations:  Stochastic  Block  Models  of  Mixed  Membership 

The  class  of  stochastic  block  models  of  mixed-membership  is  a  rich  class  of  models  that  is  instru¬ 
mental  for  thinking  about  the  scientific  problems  outlined  in  Section  3.1  and  amenable  to  theoret¬ 
ical  analysis.  A  general  formulation  characterizes  stochastic  block  models  of  mixed-membership 
in  terms  of  assumptions  at  four  levels,  as  follows  (Airoldi  et  al.,  2006d). 

Assumption  5  (Population  Level).  There  are  K  classes  or  sub-populations  in  the  population  of 
interest.  Denote  by  f(yjnm\hgh)  the  probability  distribution  of  the  j-th  response  graph  at  the  pair 
of  nodes  ( n ,  m),  where  the  n-th  node  is  in  the  h-th  sub-population,  the  m-th  node  is  in  the  k-th 
sub-population,  and  r]gh  contains  the  relevant  parameters.  The  indices  n,  m  run  in  A f,  and  the 
indices  g,  h  run  in  [1,  K\.  Within  sub-population  pairs,  the  observed  paired  responses  are  assumed 
independent. 

Assumption  6  (Node  Level).  The  components  of  the  membership  vector  9n  =  (9n] , . . .  ,9nK)' 
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encodes  the  mixed-membership  of  the  n-th  node  to  the  various  sub-populations.  The  distribution 
of  the  observed  response  yjnm  given  the  relevant,  node-specific  membership  scores,  ((),<■  Orn),  is 
then 

K 

Pr  {jjjnm  | dm  dmi  h)  ^  ^  9ngf(yjnm\  Pgh)9mh.  (4.8) 

g,h= i 

Conditional  on  the  mixed-membership  scores,  the  response  edges  yjnm  are  independent  of  one 
another,  both  across  distinct  graphs  and  pairs  of  nodes. 

Assumption  7  (Latent  Variable  Level).  The  mixed-membership  vectors,  6\,n,  are  independent 
realizations  of  a  latent  random  quantity  with  distribution  Da,  parameterized  by  a  vector  of  under¬ 
lying  constants  a.  The  probability  of  observing  :)nrn,  given  the  parameters,  is  then 


Pr  (yjnm\a,p) 


K 


Ongfidj  nm  \h gh)@mh  I  P  ot^S^Qf 


g,h= l 


(4.9) 


Assumption  8  (Sampling  Scheme  Level).  The  R  independent  replications  of  the  J  distinct  re¬ 
sponse  graphs  are  independent  of  one  another.  The  probability  of  observing  the  whole  collection 
of  graphs,  {yjrnm},  given  the  parameters,  is  then  given  by  the  following  equation. 

„  /  J  R  n  k  \ 

Pr  (  {yjrnm}  \  Ol,  T)  )  =  /  nnn  e  @ngf  (,yjrnm\dgh)dmh  J  Da(d9f  (4.10) 

\  j= 1  r= 1  n,m=  1  g,h= 1  J 

The  number  of  replications  is  not  necessarily  the  same  across  different  response  graphs,  i.e.,  R  = 
Rj.  Likewise,  the  block  model  can  be  response  specific,  i.e.,  p  =  rp.  More  variations  along  these 
lines  are  possible. 

A  graphical  representation  of  models  in  this  family  is  given  in  Figure  4.4.  Full  model  specifica¬ 
tions  immediately  adapt  to  the  different  kinds  of  data,  e.g.,  multiple  data  types  through  the  choice 
of  /,  or  parametric  or  semi-parametric  specifications  of  the  prior  on  the  number  of  clusters  through 
the  choice  of  Da. 
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Figure  4.4:  The  graphical  representation  of  stochastic  block  models  of  mixed  membership  using 
plates.  For  clarity,  few  arrows  out  of  the  block  models  rft:j  are  shown,  however,  all  interactions 
yjmm  depend  on  the  corresponding  block  model. 
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Example  24  (Admixture  of  Latent  Blocks).  Airoldi  et  al.  (2006c,  2007b)  introduced  the  Admixture 
of  Latent  Blocks  model  to  analyze  a  collection  of  protein-protein  interactions.  This  model  is  defined 
by  the  simplest  set  of  model  specifications  for  a  stochastic  block  model  of  mixed  membership,  and 
it  was  used  to  analyze  the  most  basic  kind  of  relational  data.  Given  a  single  undirected  unipartite 
graph  with  binary  edges,  the  Admixture  of  Latent  Blocks  model  recovers  membership  of  nodes  to 
clusters  (i.e.,  the  mixed  membership  vectors  d1:N)  and  cluster-to-cluster  interaction  probabilities 
(i.e.,  the  block  model  rf),  under  the  assumption  that  K  non-observable  clusters  exist.  Using  this 
model  on  protein-protein  interaction  data:  sub-populations  correspond  to  non-observable  “stable 
protein  complexes” ,  indexed  by  k;  nodes  correspond  to  “proteins”,  indexed  by  n;  there  is  only  one 
response  variable  that  encodes  whether  a  pair  of  proteins  interacts  or  not,  so  that  j  is  omitted; 


Figure  4.5:  The  graphical  representation  of  the  admixture  of  latent  blocks  introduced  by 
Airoldi  et  al.  (2006c)  using  plates.  Note  that  only  few  arrows  out  of  the  block  model  77  have  been 
drawn,  for  clarity,  however  all  the  interactions  ynm  depend  on  it. 


127 


4.3.  STRATEGIES  FOR  INTEGRATING  COMPLEX  DATA 


E.M.  AIROLDI 


there  is  only  one  replicate,  since  the  interactions  have  been  measured  with  an  experimental  pro¬ 
cedure  such  as  “Yeast  Two  Hybrid”  under  a  single  experimental  condition.  The  model  assumes 
that  each  interaction  in  the  collection  is  either  present  or  absent  given  the  memberships  to  specific 
protein  complexes  of  the  pair  of  single  proteins  involved.  That  is,  each  protein  participates  in  the 
various  interactions  as  a  member  of  possibly  different  protein  complexes.  In  order  to  simplify  the 
inference,  an  explicit  pair  of  indicator  variables  {zfm.  zfrn)  is  introduced  for  each  interaction  in 
the  observed  collection,  which  indicates  the  protein  complexes  that  the  two  proteins  are  members  of 
as  they  interact.  The  function  f(ynm\ngh)  =  Pr  ( ynm  =  1  \zfm  =  9,znm  =  h)  =  Bernoulli  (rjgh), 
where  qgh  is  the  probability  that  a  protein  in  complex  g  interacts  with  a  protein  in  complex  h.  A 
mixed-membership  vectors  0 1 :  /y  encode  the  expected  protein  complex  proportions.  They  are  dis¬ 
tributed  according  to  Da,  i.  e. ,  a  Dirichlet  distribution.  Equation  4.8  is  obtained  by  integrating  out 
the  protein  complex  indicator  variables  (zfm,  zfrn )  at  the  interactions  level — the  latent  indicators 
zfrn  are  distributed  according  to  a  Multinomial  (1,  6n),  whereas  the  latent  indicators  zfrn  are 
distributed  according  to  a  Multinomial  (1,  9tn).  A  graphical  representation  of  this  specific  model 
is  given  in  Figure  4.5. 


4.3  Strategies  for  Integrating  Complex  Data 


Integration  of  the  measurements  on  relations  and  attributes  involving  objects  of  different  types  can 
take  many  forms.  For  the  purposes  of  this  thesis,  it  will  suffice  to  distinguish  two  types  of  integra¬ 
tion,  one  relates  to  descriptive  versus  predictive  analyses,  and  the  other  relates  to  the  integration  of 
labels. 
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4.3.1  Descriptive  Analyses 

In  a  descriptive  analysis,  non-observables  always  contribute  equally  to  the  data  generation,  and, 
in  turn,  observables  always  inform  equally  the  inference  process  about  non-observables.  This  is 
what  happens,  for  example,  to  the  multivariate  relations  and  multivariate  attributes  in  the  previous 
Section;  at  the  sampling  scheme  level  relations  and  attributes  of  different  types  are  assumed  to  be 
independent. 

A  layer  of  complication  may  be  introduced.  Consider  a  data  set  with  N  objects  and,  for  sim¬ 
plicity,  assume  a  fixed  number  of  latent  patterns,  K.  Consider  measurements  on  J  attributes  for 
each  on  each  object.4  The  mixed  membership  vectors  in  the  model  are  object- specifc  tt  1:Ar;  should 
they  be  attribute  specific  as  well?  In  the  general  formulation  in  Section  4.2.1  the  answer  is  no,  but 
it  need  not  be  so.  Introducing  mixed  membership  vectors  that  are  specific  to  object- attribute  pairs 
allows  for  more  flexibility  in  the  description  of  the  data.  However,  the  description  inferred  from 
the  data  may  not  be  optimal  when  the  goal  of  the  analysis  is  to  predict  one  attribute  given  the  rest 
(Barnard  et  al.,  2003). 


4.3.2  Predictive  Analyses 

In  a  predictive  analysis,  one  set  of  non  observables  always  contributes  to  the  data  generation  con¬ 
ditionally  on  the  values  assumed  by  a  second  set  of  observables,  and,  in  turn,  the  two  sets  of 
observables  inform  the  inference  process  about  non-observables  unequally — namely,  the  informa¬ 
tion  the  latter  set  contributes  to  the  inference  process  is  used  to  describe  residual  variability ,  which 
cannot  be  explained  by  information  contributed  by  the  former  set  of  observables.  This  is  what 
happens  to  the  labels  Z  in  Example  11. 

4The  discussion  applies  to  a  set  of  J  relations,  unchanged. 
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A  Lengthy  Example  Consider  observations  consisting  of  T  sets  of  edges,  YVT,  among  a  com¬ 
mon  set  of  nodes,  A f.  The  data  generating  process  is  as  follows. 

1.  For  each  node  n  —  1, . . . ,  N 

1.1.  Sample  the  mixed-membership  vector  nri  ~  Dirichlet  (  a  ) 

1.2.  Sample  the  component  indicator  zn  ~  Multinomial  k  ( 7fn,  1  ) 

1.3.  Sample  the  latent  representation  xn  ~  \\k=i  Gaussian  2  (  jakl  )Znk 

2.  For  each  pair  of  nodes  (n,  m)  G  [1 ,  N]  x  [1,  N] 

2.1.  Sample  the  value  of  interactions  from  a  generalized  linear  model 

Unm  ~  Generalized  Linear  Model  (  link  =  g^1  ) 

where  the  link  function  g  maps  the  support  of  the  average  response  function  =  E  [  ynm  ]  onto 
M,  that  is,  the  support  of  the  linear  model  r)nm.  The  linear  model  r/nm  =  rjnm  (  T,xn,  xrn  )  involves 
latent,  node-specific  covariates,  xn  G  X,  and  a  global  drift,  (3 ,  shared  by  all  nodes.  A  graphical 
representation  of  the  DGP  for  the  parametric  case,  using  plates,  is  shown  in  Figure  4.6. 

The  data  generating  process  posits  that  representations  of  nodes  in  a  low  dimensional  latent 
space,  x\.N  G  X,  are  sampled  independently  for  each  graph,  Gt,  from  a  finite  mixture  of  K 
Gaussians  with  parameters  (jli-.K,  Y1:K),  which  encode  the  group  centroids  in  the  latent  space  X  for 
al  graphs5.  At  the  top  of  the  hierarchy,  the  mixed  membership  vectors,  x\ . N ,  are  independent  and 
identically  distributed  samples  from  a  Dirichlet  distribution  over  the  fT-dimcnsional  simplex  with 
hyper-parameter  vector  a.  They  provide  the  mixture  weights.  The  edge  weights  are  then  generated 
through  a  “generalized  linear  model”  that  makes  use  of  the  low  dimensional,  latent  representations 

5For  example,  if  we  take  the  low  dimensional  space  X  to  be  R2,  then  each  one  of  the  K  components  of  the  mixture 
of  Gaussians  is  a  two-dimensional  Gaussian. 
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of  nodes,  Xi:N,  as  covariates,  along  with  an  extra  parameter  (3.  In  particular,  each  edge  weight, 
2/*m,  may  be  generated  starting  from  the  relevant  pair  of  node  representations,  (xn,  xrn),  through  a 
distance  model. 

Following  the  formalism  in  McCullagh  and  Nelder  (1989)  we  specify  the  generalized  linear 
model  that  generates  the  observed  edge  weights  ynm  (see  step  2.1  of  the  data  generating  process) 
in  terms  of  three  elements. 


1.  The  error  model,  p(ynm),  he.,  the  model  for  the  observed  edge  weights  with  mean  finrn  = 


o 

K 


\\\  \ 


Figure  4.6:  The  graphical  representation  of  the  parametric  model  using  plates,  for  a  set  of  T 
matrices.  Note  that  we  did  not  draw  all  the  arrows  out  of  7,  for  clarity,  since  all  the  interactions 
Unm  depend  on  it. 
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2.  The  linear  model,  r/nm  =  r)nm(JT  xn,  xrn)  =  rjnm  (  6,  d(xn,  xrn)  ),  for  any  explicit  distance 
model  d. 

3.  The  link  function6,  g(pnm)  =  qnrn,  which  maps  the  support  of  ynrn  to  that  of  r)nm — typically 

M. 


In  particular,  the  linear  model  qnm  includes  an  explicit  distance  model,  d,  in  the  latent  space,  X . 
Using  the  models  proposed  in  Hoff  et  al.  (2002), 


T]nm 


T)nm  (  P  ->  %m  ) 

T] nm  (  Pi  ) 

P  —  \xn  —  xm\  distance  model 
3  +  projection  model. 

\%m\ 


(4.11) 


Intuitively,  edges  are  more  likely  to  be  generated  between  paris  of  nodes  whose  corresponding 
representations  in  the  latent  space  are  close. 

Note  3.  In  a  binary  graph  we  can  posit  p(ynm)  =  Bernoulli  (pnm),  where  pnm  G  [0, 1]  for 
all  node  pairs  ( n,m )  G  A f.  The  linear  model  is  pnm  —  (3  +  d(xn,  xm).  The  link  function  is 
g{dnm)  =  log  (  1(t"m  ),  and  its  inverse  is  pnm  =  1+cxp(L7; — y-  In  a  graph  with  non-negative, 

integer  edge  weights  we  can  posit  p(ynm)  =  Poisson  (pnm),  where  pnm  G  R+for  all  node  pairs 
(n,  m)  G  J\f.  The  linear  model  is  ynm  =(  /3,  dfxn.  xrn  ),  as  in  Equation  4.11.  The  link  function  is 
g(dnm)  =  log(/inm),  and  its  inverse  is  pnm  =  eVnm. 


This  model  follows  closely  the  models  in  Hoff  et  al.  (2002)  and  Handcock  et  al.  (2007),  with 
the  novelty  that  it  depends  on  the  set  of  mixed  membership  vectors  7?i:jv.  In  a  sense,  it  is  predictive 
because  a  model  for  the  joint  probability  of  latent  variables  and  data  is  missing,  p(Y,  tti-.n,  xi-.n)- 
However,  in  order  to  make  it  predictive  in  the  sense  intended  here  a  little  more  work  is  needed. 

6Here  I  consider  canonical  link  functions  (McCullagh  and  Nelder,  1989). 
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First,  we  need  to  introduce  a  second  source  of  information,  for  example,  multivariate  attributes  on 
the  nodes,  U\-n,  where  the  quantity  un(m)  encodes  the  value  of  the  rri-th  attribute  measured  on 
the  n-th  node.  Then  we  posit  a  data  generating  process  for  the  attributes,  e.g.,  along  the  lines  of 
the  models  in  Section  4.1.  Finally,  and  here  is  where  the  predictive  is  used  in  the  sense  I  intend, 
we  need  to  make  a  decision  about  how  to  link  the  two  models;  for  interactions  and  attributes.  If  the 
goal  of  the  analysis  is  that  of  predicting,  or  de-noising,  interactions  from  attributes  (Airoldi  et  al., 
2006c)  then  we  want  to  condition  the  interactions  on  the  attributes  in  the  generating  process.  There 
are  several  ways  of  doing  this;  a  possibility  is  that  of  generating  node-specific  mixture  component 
indicator  zn,  in  the  model  for  the  interactions  Y,  from  the  node-specific  mixture  component  indica¬ 
tors  {zff  :  m  —  1, . . . ,  M}  already  samples  for  the  attribute — see  Barnard  et  al.  (2003)  for  another 
example. 

Going  back  to  the  multivariate  models  of  attributes  and  relations  of  Section  4.2, 1  need  to  specify 
a  generative  link  between  the  attributes  and/or  relations  at  the  sampling  scheme  level',  univariate 
measurements  are  no  longer  independent.  Figure  4.7  show  the  relevant  portion  of  the  graphical 
model  structure  that  is  common  to  models  of  both  multivariate  attributes  and  relations.  By  positing 
structural  dependencies  in  the  model  predictive  analyses  can  be  supported;  that  is,  latent  patterns 
associated  with  a  set  of  measurements  will  be  inferred  that  are  useful  in  predicting  a  different  set 
of  measurements. 


Modeling  Text  and  References  I  conclude  this  chapter  with  an  application  of  data  integration 
in  a  larger  context:  models  of  data  integration  are  instrumental  to  resolve  a  substantive  issue  about 
model  choice. 

Example  25  (Proceedings  of  the  National  Academy  of  Sciences,  PNAS).  Erosheva  et  al.  (2004) 
and  Griffiths  and  Steyvers  (2004)  report  on  their  estimates  about  the  number  of  latent  topics,  and 
find  evidence  that  supports  a  small  number  of  topics  (e.g.,  as  few  as  8  but  perhaps  a  few  dozen) 
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Figure  4.7:  The  structural  dependencies  among  the  (latent)  variables  in  the  dashed  box  distinguish 
the  type  of  analysis.  The  absence  of  dependencies  (top  panel)  leads  to  models  that  support  descrip¬ 
tive  analyses,  whereas  the  presence  of  dependencies  (bottom  panel)  leads  to  models  that  support 
predictive  analyses. 

or  as  many  as  300  latent  topics,  respectively.  There  are  a  number  of  differences  between  the  two 
analyses:  the  collections  of  papers  were  only  partially  overlapping  (both  in  time  coverage  and 
in  subject  matter ),  the  authors  structured  their  dictionary  of  words  differently,  one  model  could 
be  thought  of  as  a  special  case  of  the  other  but  the  fitting  and  inference  approaches  had  some 
distinct  and  non-overlapping  features.  The  most  remarkable  and  surprising  difference  come  in 
the  estimates  for  the  numbers  of  latent  topics:  Erosheva  et  al.  focus  on  values  like  8  and  10  but 
admit  that  a  careful  study  would  likely  produce  somewhat  higher  values,  while  Griffiths  &  Steyvers 
present  analyses  they  claim  support  on  the  order  of  300  topics!  Should  we  want  or  believe  that 
there  are  only  a  dozen  or  so  topics  capturing  the  breadth  of  papers  in  PNAS  or  is  the  number 
of  topics  so  large  that  almost  every  paper  can  have  its  own  topic?  A  touchstone  comes  from  the 
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journal  itself.  PNAS,  in  its  information  for  authors  (updated  as  recently  as  June  2002),  states  that  it 
classifies  publications  in  biological  sciences  according  to  19  topics.  When  submitting  manuscripts 
to  PNAS,  authors  select  a  major  and  a  minor  category  from  a  predefined  list  list  of  19  biological 
science  topics  (and  possibly  those  from  the  physical  and/or  social  sciences). 

Below,  I  summarize  an  alternative  set  of  analyses  (Airoldi  et  al.,  2006e)  using  the  version  of 
the  PNAS  data  on  biological  science  papers  analyzed  in  (Erosheva  et  al.,  2004).  Said  analyses 
employ  both  parametric  and  non-parametric  strategies  for  model  choice,  and  make  use  of  both 
text  and  references  of  the  papers  in  the  collection,  in  order  to  resolve  this  issue.  This  case  study 
gives  us  a  basis  to  discuss  and  assess  the  merit  of  the  various  strategies.  In  the  process  I  explore 
how  to  perform  the  model  selection  for  Bayesian  models  of  mixed-membership.  After  choosing  an 
optimal  value  for  the  number  of  topics,  K*,  and  its  associated  words  and  references  usage  patterns, 
I  also  examine  the  extent  to  which  they  correlate  with  the  actual  topic  categories  specified  by  the 
authors. 


Number  of 


Number  of  latent  topics 


Figure  4.8:  Left  Panel:  Log-likelihood  (5  fold  cv)  for  K  =  5, . . . ,  50,  75, 100,  200,  300  topics.  We 
plot:  text  only,  a  fitted  (solid  line);  text  only,  a  fixed  (dashed  line).  Right  Panel:  Log-likelihood 
(5  fold  cv)  for  A"  =  5, . . . ,  50, 100  topics.  We  plot:  text  and  references,  a  fitted  (solid  line);  text 
and  references,  a  fixed  (dotted  line). 


Six  Bayesian  mixed  membership  models  were  fitted  to  infer  the  topics  underlying  the  PNAS 
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Figure  4.9:  Posterior  distribution  of  K  for  the  PNAS  scientific  collection  corresponding  to  the 
infinite  mixture  models  of  text  (left  panel)  and  of  text  and  references  (right  panel). 

dataset:  words  alone  or  both  words  and  references  were  modeled  with  parametric  and  semi- 
parametric  mixed  model  specifications,  and  for  fully  parametric  specifications  the  Dirichlet  hyper¬ 
parameter  a  was  either  fitted  using  an  empirical  Bayes  strategy  or  fixed  with  an  ad-hoc  strategy 
inspired  by  the  one  used  in  the  analysis  of  PNAS  data  by  Griffiths  and  Steyvers  (2004).  Full  de¬ 
tails  about  model  specifications  and  posterior  inference  algorithms,  using  both  variational  methods 
and  MCMC,  are  given  in  Airoldi  et  al.  (2006e).  See  the  right  panel  of  Figure  4.3  for  a  graphical 
representation  of  the  models  of  text  and  references. 

The  plots  of  the  log  likelihood  in  Figure  4.8  suggest  a  number  of  topics  between  20  and  40 
whether  words  or  words  and  references  are  used.  The  semi-parametric  model  generates  a  posterior 
distribution  for  the  number  of  topics,  K ,  given  the  data.  Figure  4.9  shows  the  posterior  distribution 
ranges  from  23  to  33  profiles.  We  can  expect  that  the  semi-parametric  model  will  require  more 
topics  than  the  parametric  model,  since  it  leads  to  a  hard  clustering  of  documents — into  topics.  By 
choosing  K  =  20  topics,  a  meaningfully  interpretation  all  of  the  word  and  reference  usage  patterns 
can  be  found.  A  parametric  model  with  20  topics  was  fitted  to  the  data,  both  words  and  references, 
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Pharmacobgy 
Physiobgy  - 
Developmental  Biobgy 
Plant  Biobgy  - 
Micro  biobgy 
Evolutbn 
Biophysics 
Immunobgy  - 
Genetics 
Cell  Biobgy  - 


fT 


L 

I 


■  55%-65% 

■  45%-55% 

■  35%-45% 
H  25%-35% 

□  15%-25% 

□  1  0%-1 5% 

□  1  0%  or  less 


Neuro biobgy  - 
Medcal  Sciences  - 


Biochemistry  - 


1  I 


I  I 
I 


5  10  15  20 


Figure  4.10:  The  average  membership  in  the  20  latent  topics  (columns)  for  articles  in  thirteen 
of  the  PNAS  editorial  categories  (rows).  Darker  shading  indicates  higher  membership  of  articles 
submitted  to  a  specific  PNAS  editorial  category  in  the  given  latent  topic  and  white  space  indicates 
average  membership  of  less  than  10%.  Note  that  the  rows  sum  to  100%  and  therefore  darker  topics 
show  concentration  of  membership  and  imply  sparser  membership  in  the  remaining  topics.  These 
20  latent  topics  were  created  using  the  four  finite  mixture  models  with  words  only  ( I  sl\  2nd )  or 
words  and  references  (3rd,  4th)  and  a  estimated  (1st,  3rd)  or  fixed  (2nd,  4th). 
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Table  4.3:  Word  usage  patterns  corresponding  to  the  model  of  text  &  references,  with  K  =  20 
topics. 


Topic  1 

Topic  2 

Topic  3 

Topic  4 

Topic  5 

gene 

kinase 

cells 

cortex 

species 

genes 

activation 

virus 

brain 

evolution 

sequence 

receptor 

gene 

visual 

population 

chromosome 

protein 

expression 

neurons 

populations 

analysis 

signaling 

human 

memory 

genetic 

genome 

alpha 

viral 

activity 

selection 

sequences 

phosphorylation 

infection 

cortical 

data 

expression 

beta 

cell 

learning 

different 

human 

activated 

infected 

functional 

evolutionary 

dna 

tyrosine 

vector 

retinal 

number 

number 

activity 

protein 

response 

variation 

identified 

signal 

vectors 

results 

phylogenetic 

Topic  6 

Topic  7 

Topic  8 

Topic  9 

Topic  10 

enzyeme 

plants 

protein 

protein 

cells 

reaction 

plant 

ma 

model 

cell 

ph 

acid 

proteins 

folding 

tumor 

activity 

gene 

yeast 

state 

apoptosis 

site 

expression 

mrna 

energy 

cancer 

transfer 

arabidopsis 

activity 

time 

p53 

mu 

activity 

trna 

structure 

growth 

state 

levels 

translation 

single 

human 

rate 

cox 

vitro 

molecules 

tumors 

active 

mutant 

splicing 

fluorescence 

death 

oxygen 

light 

complex 

force 

induced 

electron 

biosynthesis 

gene 

cdata 

expression 

Topic  1 1 

Topic  12 

Topic  13 

Topic  14 

Topic  15 

transcription 

dna 

cells 

protein 

ca2+ 

gene 

rna 

cell 

membrane 

channel 

expression 

repair 

expression 

proteins 

channels 

promoter 

strand 

development 

atp 

receptor 

binding 

base 

expressed 

complex 

alpha 

beta 

polymerase 

gene 

binding 

cells 

transcriptional 

recombination 

differentiation 

cell 

neurons 

factor 

replication 

growth 

actin 

receptors 

protein 

single 

embryonic 

beta 

synaptic 

dna 

site 

genes 

transport 

calcium 

genes 

stranded 

drosophila 

cells 

release 

activation 

cdata 

embryos 

nuclear 

cell 

Topic  16 

Topic  17 

Topic  18 

Topic  19 

Topic  20 

peptide 

cells 

domain 

mice 

beta 

binding 

cell 

protein 

type 

levels 

peptides 

il 

binding 

wild 

increased 

protein 

hiv 

terminal 

mutant 

insulin 

amino 

antigen 

structure 

gene 

receptor 

site 

immune 

proteins 

deficient 

expression 

acid 

specific 

domains 

alpha 

induced 

proteins 

gamma 

residues 

normal 

mice 

affinity 

cd4 

amino 

mutation 

rats 

specific 

class 

beta 

mutations 

treatment 

activity 

mice 

sequence 

mouse 

brain 

active 

response 

region 

transgenic 

effects 

to  focus  on  the  interpretation  of  the  20  topics.  Table  4.3,  lists  12  high-probability  words  from  the 
estimated  20  topics  after  filtering  out  the  stop  words.  Table  4.4  shows  the  5  references  with  the 
highest  probability  for  6  of  the  topics. 

Using  both  tables,  here  is  a  possible  interpretation  of  the  20  topics: 


•  Topics  1  and  12  focus  on  nuclear  activity  (genetic)  and  (repair/replication). 


•  Topic  2  concerns  protein  regulation  and  signal  transduction. 
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Table  4.4:  References  usage  patterns  for  6  of  the  20  topics  corresponding  to  the  model  of  text  & 
references,  with  K  =  20  topics. 


Author 

Journal 

Topic  2 

THOMPSON,CB 

SCIENCE,  1995 

XIA,ZG 

SCIENCE,  1995 

DARNELL, JE 

SCIENCE,  1994 

ZOU,H 

CELL,  1997 

MUZIO,M 

CELL,  1996 

Topic  5 

SAMBROOK,J 

MOL.  CLONING.  LAB.  MANU.,  1989 

ALTSCHUL,SF 

J.  MOL.  BIOL.,  1990 

EISEN,MB 

P.  NATL.  ACAD.  SCI.  USA,  1998 

ALTSCHUL,SF 

NUCLEIC.  ACIDS.  RES,  1997 

THOMPSON, JD 

NUCLEIC.  ACIDS.  RES,  1994 

Topic  7 

SAMBROOK,J 

MOL.  CLONING.  LAB.  MANU,  1989 

THOMPSON,  JD 

NUCLEIC.  ACIDS.  RES,  1994 

ALTSCHUL,SF 

J.  MOL.  BIOL,  1990 

SAITOU,N 

MOL.  BIOL.  EVOL,1987 

ALTSCHUL,SF 

NUCLEIC.  ACIDS.  RES,  1997 

Topic  8 

SAMBROOK,J 

MOL.  CLONING.  LAB.  MANU,  1989 

KIM,NW 

SCIENCE,  1994 

BODNAR.AG 

SCIENCE,  1998 

BRADFORD,MM 

ANAL.  BIOCHEM.,  1976 

FISCHER,U 

CELL,  1995 

Topic  17 

SHERRINGTON,R 

NATURE,  1995 

HO,DD 

NATURE,  1995 

SCHEUNER,D 

NAT.  MED., 1996 

THINAKARAN,G 

NEURON,  1996 

WEI,X 

NATURE,  1995 

Topic  20 

CHOMCZYNSKLP 

ANAL.  BIOCHEM.,  1987 

BRADFORD,MM 

ANAL.  BIOCHEM.,  1976 

KUIPER,GGJM 

P.  NATL.  ACAD.  SCI.  USA,  1996 

MONCADA,S 

PHARMACOLREY,  1991 

KUIPER,GG 

ENDOCRINOLOGY,  1998 

•  Two  topics  are  associated  with  the  study  of  HIV  and  immune  responses:  topic  3  is  related  to 
virus  treatment  and  topic  17  concerns  HIV  progression. 


•  Two  topics  relate  to  the  study  of  the  brain  and  neurons:  topic  4  (behavioral)  and  topic  15 
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(electrical  excitability  of  neuronal  membranes). 

•  Topic  5  is  about  population  genetics  and  phylogenetics. 

•  Topic  7  is  related  to  plant  biology. 

•  Two  topics  deal  with  human  medicine:  topic  10  with  cancer  and  topic  20  with  diabetes  and 
heart  disease. 

•  Topic  13  relates  to  developmental  biology. 

•  Topic  14  concerns  cell  biology. 

•  Topic  19  focus  on  experiments  on  transgenic  or  inbred  mutant  mice. 

•  Several  topics  are  related  to  protein  studies,  e.g.,  topic  9  (protein  structure  and  folding),  topic 
11  (protein  regulation  by  transcription  binding  factors),  and  topic  18  (protein  conservation 
comparisons). 

•  Topics  6,  8,  and  16  relate  to  biochemistry. 

These  labels  for  the  topics  are  primarily  convenience,  but  they  do  highlight  some  of  the  overlap 
between  the  PNAS  sections  (Plant  Biology  and  Developmental  Biology)  and  the  latent  topics  (7 
and  13).  However,  many  plant  biologists  may  do  molecular  biology  in  their  current  work.  By 
examining  the  topics  ones  can  see  that  small  sections  such  as  Anthropology  do  not  emerge  as  topics 
and  broad  sections  such  as  Medical  Science  and  Biochemistry  have  distinct  subtopics  within  them. 
This  also  suggests  special  treatment  for  general  sections  such  as  Applied  Biology  and  cutting-edge 
interdisciplinary  papers  when  evaluating  the  classification  effectiveness  of  a  model. 

To  summarize  the  distribution  of  latent  aspects  over  distributions,  a  graphical  representations  of 
the  distribution  of  latent  topics  for  each  of  the  PNAS  topics  is  provided  in  Figure  4.10.  The  third 
figure  represents  the  model  used  for  Tables  4.3  and  4.4.  The  two  figures  on  the  right  represent 
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models  where  the  a  parameter  of  the  Dirichlet  prior  over  topics  is  fixed.  These  two  models  are 
less  sparse  than  the  corresponding  models  with  a  fit  to  the  data.  For  twenty  latent  topics,  the 
hyper-parameter  a  was  fixed  at  50/20  =  2.5  >  1  and  this  means  each  latent  topic  is  expected  to  be 
present  in  each  document  and  a  priori  we  expect  equal  membership  in  each  topic.  By  contrast  the 
fitted  values  of  a  are  less  than  one  lead  to  models  that  expect  articles  to  have  high  membership  in  a 
small  number  of  topics.  The  PNAS  topics  tend  to  have  a  few  latent  topics  highly  represented  when 
a  is  fit  and  low  to  moderate  representation  in  all  topics  when  a  is  fixed  (as  seen  by  white/light 
colored  rows).  For  additional  discussion  of  further  consequences  of  these  assumptions  see  the 
simulation  at  the  end  of  Section  6.2.2. 

Further  examining  Figure  4.10,  note  that  topic  1,  identified  with  genetic  activity  in  the  nucleus, 
was  highly  represented  in  articles  from  Genetics,  Evolution,  and  Microbiology.  Also  note  that 
nearly  all  of  the  PNAS  classifications  are  represented  by  several  word  and  reference  usage  patterns 
in  all  of  the  models.  This  highlights  the  distinction  between  the  PNAS  topics  and  the  discovered 
latent  topics.  The  assigned  topics  used  in  PNAS  follow  the  structure  of  the  historical  develop¬ 
ment  of  Biological  Sciences  and  the  divisions/departmental  structures  of  many  medical  schools 
and  universities.  The  latent  topics,  however,  show  the  greater  ideas  of  interest  within  the  field. 
Topic  9,  which  concerns  the  structure  and  topology  of  proteins,  is  highly  represented  in  theoret¬ 
ical  papers  in  Evolution,  Genetics,  Cell  and  Developmental  Biology  as  well  as  in  applied  papers 
in  Ecology,  Pharmacology,  and  Applied  Biological  Sciences.  These  latent  topics,  however,  are 
structured  around  the  current  interest  of  Biological  Sciences.  Figure  4.10  also  shows  that  there  is 
a  lot  of  hope  for  collaboration  and  interest  between  separate  fields  which  are  researching  the  same 
ideas. 

The  held-out  log  likelihood  plot  corresponding  to  five-fold  cross  validation  in  Figure  4.8  sug¬ 
gests  a  number  between  20  and  40  topics  for  the  parametric  model.  Further  analyses  with  para¬ 
metric  mixed  membership  models  of  words  and  references  supports  support  values  towards  the 
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lower  end  of  this  range,  i.e.,  K  =  20,  more  than  other  choices.  This  is  also  true  in  the  posterior 
distribution  of  K  for  the  semi-parametric  mixed  membership  model.  To  conclude,  the  hyper¬ 
parameter  a  was  fixed  to  50/ A",  following  the  choice  in  Griffiths  and  Steyvers  (2004),  as  well  as 
estimated  using  empirical  Bayes.  Both  sets  of  analyses  produced  a  similar  conclusion.  While 
Griffiths  and  Steyvers  (2004)  found  posterior  evidence  for  nearly  300  topics,  a  number  on  the  or¬ 
der  of  20  or  30  provides  a  far  better  fit  to  the  data,  assessed  robustly  by  multiple  criteria  and  model 
specifications  that  integrate  different  types  of  data.  Moreover,  a  lower  number  of  latent  topics 
appears  to  be  simpler  and  more  interpretable  in  a  meaningful  way;  this  is  not  possible  with  300 
topics. 


*  *  * 

The  generative  models  for  attributes  presented  here  differ  much  from  published  alternatives  (e.g., 
Pritchard  et  al.,  2000;  Blei  et  al.,  2003)  in  terms  of  the  way  the  data  inform  the  allocation  of  objects 
to  patterns.  For  example,  they  are  models  of  counts  that  lead  to  variability  in  the  totals,  and  such 
variability  has  influence  on  the  allocation — see  Section  6.2.1  for  a  discussion.  I  derived  a  multi¬ 
variate  characterization  of  both  models  of  attributes,  in  Section  4.1,  and  relations,  in  Section  3.1. 
Fast  posterior  inference  is  available  for  the  general  formulations  as  well  (Airoldi  et  al.,  2006d,e).  I 
described  alternative  strategies  for  integrating  multiple  sources  of  data  in  such  models,  depending 
on  whether  the  goals  of  the  analysis  is  descriptive  or  predictive. 

Integrating  heterogeneous  data  types  under  a  unified  model  is  a  challenge  to  the  analysis  of 
complex  data ,  which  are  simultaneously  described  by  intrinsically  different  types  of  characteris¬ 
tics,  such  as  features  in  attribute  space  and  links  in  relational  space.  This  chapter  suggests  that 
Bayesian  models  of  mixed  membership  provide  us  with  a  solution  to  modeling  and  algorithmic 
issues  that  arise  in  (what  I  term)  integrated-learning  problems  that  involve  complex  data  by  com¬ 
bining  modules  specific  to  multivariate  attribute  and  relations  within  a  hierarchical  framework. 
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Research  along  this  line  is  still  very  limited,  especially  work  based  on  well-founded  statistical 
principles.  My  methodology  supports  robust  inter-modal  inference,  latent  mechanism  discovery, 
and  information  retrieval.  The  strategies  for  integrating  complex  data  presented  here  enable  mod¬ 
ular  and  distributable  engineering  solutions  for  organization  and  prediction  problems.  Feasible 
engineering  approaches  to  such  problems  in  essence  require  realistic  statistical  models,  accompa¬ 
nied  by  scalable  computational  methods. 
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Chapter  5 


Dynamics  and  Evolution 


In  this  chapter  I  describe  how  the  models  introduced  in  previous  chapters  can  be  extended  to  take 
temporal  evolution  into  account.  To  this  extent,  several  models  of  dynamic  behavior  are  present  in 
the  classical  statistical  literature,  which  can  be  used  to  model  the  evolution  of  latent  patterns  for  a 
finite  number  of  epochs,  T.  The  basic  idea  is  to  chose  which  sets  of  variables  evolves  over  time, 
e.g.,  01;T,  and  posit  a  model  for  the  transition,  e.g.,  P, 4(04|0i_1). 

For  instance,  recall  the  state-space  model  of  Example  10,  which  extends  the  factor  analysis 
model  of  Example  7  by  evolving  the  latent  factors  from  one  epoch  to  the  next.  The  data  generating 
process  for  the  observations  ]  is  as  follows, 

1 .  At  epoch  t  —  0 

1.1.  For  each  object  n  E  M 

1.1.1.  Sample  the  latent  factors  on  ~  Normal  k  (0, 1) 

1.1.2.  Sample  the  error  4°^  ~  Normal  m  (0,  \E) 

1.1.3.  Define  the  multivariate  attribute  Xn]  =  A0n  +  e£\ 

2.  At  epoch  0  <  t  <  T 
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2.1.  For  each  object  n  E  M 

2.2.1.  Evolve  the  latent  factors  =  Ffn~x\ 

2.2.2.  Sample  the  error  e$  ~  Normal  m  (0,  4/) 

2.2.3.  Define  the  multivariate  attribute  x-a  =  +  rf 1 , 

where  A  is  a  (A"  x  AT)  matrix  that  encodes  the  dynamics  of  the  latent  factors.  As  before,  the  algo¬ 
rithm  suggests  a  hierarchical  decomposition  of  the  joint  probability  distribution  of  the  attributes, 
X<1:T>  =  and  the  latent  factors,  @(1:T)  —  (o\'p,  ),  given  a  set  of  underlying  constants, 

A  =  (A,  A,  4/)  that  does  not  change  over  time.1  The  likelihood  is  then, 


(5.1) 


t= i 


where  Pi  and  P2  are  K-  and  M-dimensional  Gaussian  densities,  respectively,  and  A0  is  the  deter¬ 
ministic  transformation  in  Step  2.2.1.  of  the  data  generating  process.  A  graphical  representation 
of  FA  and  SSM  is  given  in  Figure  5.1,  which  highlights  the  simple  connection  between  the  two 
models. 

Model  specifications  vary  depending  on  the  applications,  e.g.,  ARIMA,  possibly  multivariate, 
linear  versus  non-linear,  deterministic  versus  stochastic  transition,  Gaussian  versus  non-Gaussian 
errors,  or  Markovian  versus  complex  (Wasserman,  1980;  Rabiner,  1989;  Brockwell  and  Davis, 
1991;  Karr,  1991;  West  and  Harrison,  1997;  Doucet  et  al.,  2001). 

Example  26.  The  admixture  of  latent  blocks  model  of  Section  3.1  is  a  model  for  a  network  Yt. 
Denote  by  P(Yt\a,  B)  the  model  for  the  network  at  time  t,  given  the  hyper-parameter  a,  which 
governs  the  distribution  of  the  mixed  membership  vectors  ifi :jv,  arid  the  stochastic  block  model  B. 

'The  dynamic  matrix  F  may  be  easily  modeled  as  time  dependent  and/or  stochastic,  as  the  problem  requires 
(Airoldi  and  Faloutsos,  2004;  Airoldi  et  al.,  2005d). 
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Figure  5.1:  Graphical  representations  of  a  factor  analysis  model  (left)  and  of  a  state-space  model 
for  observations  at  two  consecutive  epochs  (right).  White  nodes  denote  non-observables,  whereas 
shadowed  nodes  denote  observables. 

The  model  can  be  extended  to  account  for  time  by  evolving  the  hyper-parameter  a  as  follows. 

1.  At  epoch  0  <  t  <T 

1.1.  Sample  the  error  ~  Normal  k  (0,  <J2I) 

1.2.  Evolve  the  latent  position  a®  =  I  +  exp{e^}, 

This  extra  step  specifies  a  linear  transition  model,  Ifiofio1'  '  ).  In  the  same  spirit  it  is  possible  to 
evolve  the  stochastic  block  model  B. 

Example  27.  The  latent  space  model  of  mixed  membership  introduced  in  Section  4.3.2  is  a  model 
for  a  set  of  networks  Y1:T .  Denote  by  P(Yt\jli:K,  £i:x)  the  model  for  the  network  at  time  t,  given 
a  parametric  description  of  the  clusters  in  the  latent  space.  This  model  can  be  extended  to  account 
for  time,  at  epoch  0  <  t  <  T,  by  evolving  the  latent  cluster  positions  as  follows. 


1.  For  each  cluster  k  =  1 , ,K 

1.1.  Sample  the  error  ej^  ~  Normal  2  (0,  VF) 
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1.2.  Evolve  the  latent  position  =  F  fi)'.  ri  +  e£\ 

This  extra  step  in  the  process  specifies  a  set  of  K  independent  linear  transition  models,  P^k  (fl  \ 
f'k  1 ),  where  the  parameter  sets  Ak  =  ’ll,  for  all  k. 

Alternatively,  it  is  possible  to  specify  temporal  patterns  directly  as  a  part  of  (H,  0).  Such 
a  modeling  strategy  allows  to  consider  longitudinal  sequences  of  observations  about  objects  as 
admixtures  of  complicated  patterns,  specified  in  a  parametric  or  non-parametric  fashion,  and  avoids 
technical  issues  that  arise  when  considering  the  specification  of  an  explicit  model  of  evolution. 


5.1  Dynamic  Network  Tomography 

The  models  discussed  above  resolve  the  observed  sequence  of  graphs  into  simple  patterns,  which 
evolve  over  time  with  some  regularity.  Independently  of  how  such  patterns  are  specified,  their 
description  is  parsimonious.  In  some  problems,  however,  we  need  to  solve  the  opposite  problem; 
namely,  that  of  resolving  the  observed  sequence  of  graphs  into  patterns  with  an  order  of  complexity 
higher  than  that  of  the  observations.  In  other  words,  for  latent  patterns  0  e  T  and  observations 
y  £  y,  in  the  models  considered  so  far  the  dimensionality  of  T  was  lower  than  that  of  y.  This  is 
no  longer  true  in  the  models  presented  in  this  section,  where  the  dimensionality  of  the  space  T  is 
higher  than  that  of  y.  Problems  of  this  sort,  where  the  solution  space  is  orders  of  magnitude  larger 
than  the  space  spanned  by  the  data  and  the  constraints,  are  referred  to  as  inverse  problems  in  the 
literature  (Hansen,  1998). 

The  distinction  above  is  not  evident  from  the  graphical  representation  of  the  models.  The  issues 
are  deeper:  (i)  identifying  the  space  of  solutions  is  often  not  trivial;  (ii)  regularization  conditions 
are  needed  to  induce  a  well-behaved  optimization  problem.  The  driving  application  here  is  network 
tomography,  where  the  origin-destination  (OD)  traffic  flows  need  be  estimated,  e.g.,  who  is  corn- 
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municating  with  whom  in  a  local  area  network.  The  direct  measurement  of  the  OD  traffic  is  usually 
difficult  and  typically  unfeasible;  instead,  the  loads  on  every  link  can  be  easily  measured,  that  is, 
sums  of  desired  OD  flows.  In  a  network  with  N  nodes,  the  problem  is  then  to  recover  0{N 2) 
OD  flows  from  0(N )  sums.  Such  problem  has  been  studied  by  many  in  the  statistical  literature 
(Vanderbei  and  Iannone,  1994;  Vardi,  1996;  Tebaldi  and  West,  1998;  Cao  et  al.,  2000;  Zhang  et  al., 
2003;  Airoldi  and  Faloutsos,  2004).  The  model  proposed  here  starts  from  the  Bayesian  analysis  of 
Tebaldi  and  West  (1998),  and  extends  it  to  a  dynamic  context  by:  (i)  introducing  explicit  time  de¬ 
pendence  among  the  traffic  flows;  (ii)  positing  a  stochastic  multiplicative  process  for  the  dynamic; 
and  (iii)  positing  realistic,  non-Gaussian  marginals  for  the  traffic  flows.  The  findings  echos  those 
of  Tebaldi  and  West  (1998)  with  regard  to  the  need  for  informative  priors  in  order  to  mitigate  the 
bias  in  the  estimated  traffic  flows  due  to  the  presence  of  multiple  peaks  in  the  likelihood,  and  to  the 
presence  of  ridges  in  between  those  peaks,  e.g.,  see  Figure  5.8.  The  solution  presented  here  scales 
linearly  with  new  observations  and  is  more  accurate  then  alternative  solutions,  on  real  network 
traffic  measured  at  Carnegie  Mellon  and  at  AT&T. 


(DLocal  Likelihoood  (Gaussian)  □  Linear  SSM  (Gaussian) 

s  IA  Model  Static  (Gamma)  a  IA  Model  Dynamic  (Gamma) 

□  IA  Model  Static  (log-Normal)  □  IA  Model  Dynamic  (log-Normal) 


Figure  5.2:  Estimation  error  in  i2  distance. 
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Table  5.1:  Summary  of  symbols. 


Symbol 

Description 

T 

Number  of  time  points. 

I 

Number  of  observable  link  loads. 

K 

Number  of  non-observable  OD  flows. 

A 

(£  x  k)  fixed  routing  matrix. 

Y 

{£  x  T)  matrix  of  link  loads. 

X 

(k  x  T )  matrix  of  OD  traffic  flows. 

A 

(re  x  T)  matrix  of  means  of  X  A,  (p. 

<t> 

(1  x  T)  vector  of  scale  factors  for  X  A,  </>. 

6 

Generic  vector  of  hyper-parameters. 

vr(0) 

Generic  prior  distribution. 

5.1.1  Goals  of  the  Analysis 

Knowledge  about  the  origin-destination  (OD)  traffic  matrix  allows  network  engineers  and  man¬ 
agers  to  solve  problems  in  design,  routing,  configuration  debugging,  monitoring  and  pricing;  in 
fact  the  OD  traffic  matrix  provides  valuable  information  about  who  is  communicating  with  whom 
in  a  network,  at  any  given  time.  Unfortunately  the  direct  measurement  of  the  OD  traffic  is  usually 
difficult,  or  even  infeasible,  in  real  networks.  The  direction  of  current  research  is  to  develop  meth¬ 
ods  to  infer  the  OD  traffic  flows  from  observed  traffic  loads  on  the  links  of  the  network,  however 
the  methods  that  have  been  proposed  so  far  seem  not  to  fully  take  advantage  of  two  of  the  main 
empirically  observed  features  of  network  traffic;  namely  its  very  skewed  marginal  distribution,  and 
its  time  dependent  nature. 

I  introduce  the  inverse  allocation  model  (IA  henceforth)  which  improves  the  models  present  in 
the  literature  by  introducing  two  realistic  assumptions:  (i)  the  log-Normal  distribution  provides  a 
realistic  model  for  the  marginal  OD  traffic  flows,  (ii)  time  dependence  between  successive  flows 
on  a  same  OD  route  narrows  the  variability  of  the  estimates.  A  two-stage  estimation  procedure  is 
proposed  to  estimate  parameters  of  the  IA  model. 
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The  Problem  and  its  Facets  In  a  formulation  of  the  problem  we  want  to  solve  there  are  several 
time  series  which  we  would  like  to  estimate,  but  which  we  cannot  observe,  say,  a  vector  of  traffic 
flows  x(t)  over  times  t  =  1, T.  However,  we  are  able  to  observe  linear  combinations  of  these 
traffic  flows,  the  vector  of  link  loads  y{t)  over  times  t  —  1, T,  and  we  know  which  components 
of  x(t)  mix  into  each  of  the  components  y(i,  t )  at  each  time  t  trough  the  routing  matrix  A,  that 
does  not  change  over  time  There  are  two  modeling  aspects  to  this  problem. 

Problem  1  (Inverse  Problem).  Given  the  matrix  of  link  loads  Y^xt)  and  a  routing  matrix  A((XK), 
we  want  to  find  the  matrix  of  non-observable  OD  traffic  flows  X(kxT)  such  that  Y  =  A  ■  X. 
Always  n  >  £. 


Example  28.  The  linear  equations  that  correspond  to  the  routing  scheme  of  the  star  network  in 
Figure  5.3  below  are: 


f/(M) 

2/(2,  t) 

= 

2/(3,  t) 

1  0 
0  1 
0  1 


1 

1 - 

t-H 

x(2  ,t) 

x(3, t ) 

- 

_  z(4,t)  ^ 

(5.2) 


y(l,t)  measures  the  traffic  load  on  the  link  from  node  1  to  the  router  and  captures  both  the  OD  flow 
from  node  1  to  node  2,  x(2,  t ),  and  the  OD  flow  from  node  1  to  itself,  x(l,  t).  y(  3,  t)  measures  the 


Figure  5.3:  Two  subnetworks  connect  to  a  router.  We  observe  the  link  loads  (solid  blue  arrows), 
and  want  to  infer  the  hidden  traffic  flows  (dashed  red  arrows). 
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traffic  load  on  the  link  from  the  router  to  node  1  and  captures  both  the  OD  flow  from  node  2  to  node 
1,  x(3,  t),  and  the  OD  flow  from  node  1  to  itself,  x(l,  t ).  We  want  to  estimate  four  (k)  unobservable 
quantities  starting  from  three  (£)  independent  observations2.  The  system  is  under- specified,  n  >  I, 
hence  some  extra  information  is  needed  in  order  to  identify  one  single  solution. 

Problem  2  (Regularization).  Impose  a  set  of  additional  constraints  (or  a  penalty  term)  on  X(kxT) 
in  order  to  induce  smoothness  on  the  space  of  solutions  of  the  inverse  problem. 

The  likelihood  of  the  data  entailed  by  a  statistical  model  provides  us  with  a  natural  criterion  to 
discern  likely  solutions  from  unreasonable  ones.  Following  this  idea  we  model  the  unobservable 
quantities  x(t)  with  a  joint  probability  distribution;  this  induces  a  probabilistic  mapping  on  the 
space  of  the  observations  y(t)  via  equation  5.2,  so  that  we  can  compute  the  likelihood  of  the  ob¬ 
servations,  and  look  for  traffic  flows  that  maximize  the  probability  of  particular  data  observations. 
Unfortunately  in  time-independent  models  the  likelihood  of  y(t)  is  not  necessarily  unimodal,  even 
as  we  assume  independent  components  in  x(t),  and  even  as  we  use  well-behaved  functional  forms 
for  their  distributions.  More  information  is  needed  to  identify  a  solution.  At  this  point  there  are 
two  main  ways  to  introduce  the  extra  information  we  need.  In  a  purely  data-driven  approach  we 
would  augment  the  data  in  some  way,  whereas  in  a  knowledge-driven  approach  we  would  make 
use  of  informative  priors  in  a  Bayesian  setting,  with  the  complication  in  this  latter  case  of  defining 
what  we  mean  by  “informative”.  Data  augmentation  can  be  realized,  for  example,  by  raising  the 
likelihood  of  the  data  to  a  power,  as  in  simulated  annealing,  or  by  borrowing  observations  from 
epochs  close  in  time  to  the  current  one  to  obtain  a  smoothed  average  solution.  Alternatively,  we 
can  build  “informative”  priors  based  on  partial  knowledge  about  the  magnitude  of  the  OD  flows, 
and  update  using  Bayes  rule  and  a  “more  accurate”  data  model. 

The  two-stage  estimation  procedure  for  the  IA  model  is  suggestive  of  a  nonparametric  empirical 
Bayes  learning  strategy,  where  the  observations  are  used  to  first  calibrate  informative  priors,  and 

2We  assume  that  routers  neither  generate  nor  absorb  traffic. 
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then  to  filter  the  posterior  distributions  of  the  OD  flows  given  the  data.  The  proposed  solution: 
(i)  uses  realistic  models  for  the  OD  flows;  (ii)  takes  advantage  of  the  time  dependence  of  the  data 
while  using  the  whole  history  of  observations  (?/(l), y(t)}  to  estimate  x(t)  in  a  proper  Bayesian 
fashion. 


5.1.2  Model  Specifications 

Previous  models  assume  independent  OD  flows  across  different  epochs.  Here  I  introduce  mod¬ 
els  based  on  dynamical  systems,  which  naturally  extend  previous  approaches  by  assuming  time 
dependence  explicitly  (Brockwell  and  Davis,  1991;  West  and  Harrison,  1997;  Doucet  et  al.,  2001). 


Definition  1.  A  linear  Gaussian  state-space  model  is  defined  by  the  following  set  of  equations, 


{x(t)  =  F  ■  x(t  —  1)  +  e(t) 
y(t)  =  A{iXK)  ■  x(t) ,  t  >  1 


(5.3) 


where  {e(t)j  is  an  i.i.d.  Gaussian  process  with  variance-covariance  matrix  Q,  and  F  is  a  known 
matrix.  Further  x (0)  ~  Normal  (m,  V)  and  independent  of  e(t)  for  t  >  1. 


Classical  state-space  modeling  strategies  a  la  Box  and  Jenkins  would  look  for  the  additional 
constraints  needed  to  solve  Problem  2  in  a  known  dynamical  behavior  suggested  by  some  physical 
law  underlying  the  specific  problem  at  hand  and  from  known  seasonal  patterns  in  the  traffic,  for 
example  the  laws  of  motion  in  tracking  the  trajectories  of  moving  objects,  or  from  the  presence 
of  strong  cross-correlations  among  the  OD  flows.  This  knowledge  would  translate  into  constraints 
on  F,  and  Q  in  the  system  5.3  above,  and  would  serve  the  critical  role  of  driving  the  inferences 
towards  one  particular  solution. 
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Augmented  Gaussian  State-Space  Model  The  following  Gaussian  state-space  model  with  drift 
is  used  to  obtain  the  preliminary  estimates  for  the  OD  flows. 


{x(t)  =  F  ■  x(t  —  1)  +  Q  •  1  +  e(t) 
y{t)  =  A  ■  x{t)  +  e(t) 


=  <  L 


x(t) 

F  Q 

x(t  —  1) 

e(t) 

_ 

+ 

1 

0  I 

1 

1 

y(t)  =  [A|(T 


x(t) 

1 


e{t) 


(5.4) 


{. x(t )  =  F  ■  x (t  —  1)  +  e(t) 
y(t)  =  A  ■  x(t)  +  e(t) 


for  t  >  1,  where  1  =  (1, ...,  I)7  is  a  constant  vector  of  the  length  k,  the  parameter  (pit)  enters  into 
the  variance-covariance  matrix  of  e(t )  ~  N{0,(p(t)  ■  QT),  x(l)  ~  iV(0,y(l)),  e(t)  ~  N(0,R), 
a?(l)  ±  e(t)  and  *(1)  ±  e(f)  for  all  t  >  1,  and  finally  Q  is  a  diagonal  matrix  with  elements 
(qi,  ...,qK),  and  r  is  a  known  constant.  In  the  model  above,  if  we  set  F  =  0  there  is  a  one-to- 
one  mapping  between  (qlt ...,  qK,  (p(t))'  and  the  unique  elements  in  E(y(t)),  V ( y(t )).  Further  it  is 
straightforward  to  verify  that  the  following  lemma  holds. 

Note  4.  The  linear  Gaussian  state-space  model  in  equations  5.4  contains  the  model  in  Cao  et  al. 
(2000)  as  a  special  case.  Such  a  model  can  be  obtained  by  simply  setting  F  =  0,  hence  imposing 
independence  among  the  origin-destination  flows  x(t)  at  different  epochs. 


In  the  experiments  on  Carnegie  Mellon  origin-destination  traffic,  assuming  a  fixed  relation¬ 
ship  between  x(i,t )  and  x(i,t  +  1)  is  an  unrealistic  constraint.  One  possible  solution  is  to  as¬ 
sume  a  relationship  between  the  means  of  the  OD  flows  A (i,t)  and  A (i,t  +  1)  instead,  and  to 
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Figure  5.4:  Graphical  representations:  models  with  no  explicit  time  dependance  (left);  linear  state- 
space  models  introduces  an  explicit  dynamical  behavior  (center);  the  inverse  allocation  (IA)  model 
moves  the  explicit  time  dependence  one  layer  up  in  the  graphical  model,  thus  allowing  for  the  OD 
flows  to  be  more  diverse  (right). 


allow  for  some  error.  The  SSM  yields  smooth  estimates  that  capture  information  about  this  re¬ 
lationship,  which  we  pass  to  the  next  estimation  stage.  In  fact,  we  introduce  soft  constraints 
on  the  average  process  {A (t)t  >  1}  in  the  form  of  informative  priors  for  the  parameters  un¬ 
derlying  its  dynamical  behavior.  We  reduce  the  number  of  parameters  by  merging  dynamic  and 
error  terms  into  a  stochastic  dynamical  behavior.  The  marginal  models  for  the  OD  traffic  flows 
are  independent  log-Normals3.  The  main  objects  of  interest  are  then  the  posterior  distributions 
P(  x(t)  j  2/(1), ...,  y(t ) ).  In  particular  the  point  estimate  for  the  OD  traffic  vector  at  time  t  is  given 
by  the  mean  x(t)  =  E{  x(t )  1 2/(1), ...,  y(t) ). 


Static  Inverse  Allocation  Model  The  static  version  of  the  IA  model  considers  independent  prob¬ 
lems  at  each  epoch.  Briefly,  we  are  interested  in  estimating  x(t)  =  E(  x(t)  j  y(t) )  =  E(x\y). 


3Airoldi  (2003)  also  considers  Gamma  models. 
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To  specify  the  full  models  at  each  time  f  we  write: 


x  A,  0 


(5.5) 


y  =  Ax, 


where  p  is  log-Normal,  parameterized  so  that  E(x(i )  |A,  0)  =  A(f),  V(x(i )  |A,  0)  =  0  •  A(f)r, 
C'ot>(  x(i),x(j)  |  A,  0 )  =  0  for  i  =  1, ...,  k  and  i  7^  j.  Notice  that  0  is  common  across  OD  flows  at 
each  epoch,  and  that  r  is  a  known  scalar,  which  we  obtain  by  inspection  of  Y.  The  priors  for  the 
A (i)  are  log-Normal  02{i))4,  for  i  =  1, k  and  independent  for  i  ^  j.  The  prior  for  0  is 

proportional  to  a  constant,  to  1/0  or  to  1/02. 

Dynamic  Inverse  Allocation  Model  This  dynamic  version  of  the  IA  model,  which  yields  the 
best  results,  implements  the  following  Bayesian  dynamical  system: 


A (i,t)  =  e(i,t)  ■  \(i,t  —  1),  i  =  1, k 
<  x(t)\X(t),</>(t)  ~  p(A(f),0(f)) 

y[t)  =  A  ■  x{t),  t  >  1, 


(5.6) 


where  p  is  log-Normal,  parametrized  so  that  E(  x(i,  t )  |A(i),  0(f) )  =  A  (i,  t),  V(x(i,  t )  |A(f),  0(f) ) 
=  0  ■  A (i,  f)T,  and  Cov(x(i,  t),x(j,  f)  |  A,  0 )  =  0  for  i  =  1, ...,  k  and  i  ^  j.  Notice  that  0(f)  is 
common  across  OD  flows  at  time  f ,  and  that  r  is  a  known  scalar,  which  we  obtain  by  inspection  of 
Y.  The  priors  for  A (i,  0)  are  log-Normal  (6(i,  0),  a)4,  for  i  =  l, ....  k  and  independent  for  1.  0  j, 
and  for  a  big  number  a  that  accounts  for  the  uncertainty  of  the  means  of  OD  flows  at  time  zero. 
The  prior  for  0(f)  is  proportional  to  a  constant,  to  1  /0(f)  or  to  1  /0(f)2.  The  priors  for  e(i,  f)  are 
log-Normal  f),  02(i,  f))4  for  i  —  1, ...,  k,  and  independent  for  i  ^  j. 

4Airoldi  (2003)  also  considers  Gamma,  Uniform,  and  truncated  Gaussian  priors. 
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Table  5.2:  A  summary  of  the  models. 


Model 

Time 

Dependence 

Online 

Estimation 

Skewed 

Marginals 

Local  likelihood 

No 

No 

No 

Augmented  Gaussian  SSM 

Yes 

Yes 

No 

IA  (static) 

No 

No 

Yes 

IA  (dynamic) 

Yes 

Yes 

Yes 

Informative  Priors  for  A(£)  The  crucial  question  at  this  point  is:  how  do  we  calibrate  the  hyper¬ 
parameters  underlying  the  prior  distributions  of  A (t)?  First  we  obtain  a  preliminary  set  of  estimates 
x(t)  with  the  Gaussian  linear  SSM.  Then,  in  the  case  of  IA  static,  (01?  02)  at  each  time  are  set  so 
that  mean  and  variance  of  A  correspond  to  those  of  x(t).  Variances  can  be  made  much  larger 
without  significant  loss  of  precision.  The  intuition  is  that  the  preliminary  estimates  indicate  us 
where  OD  flows  are  on  average.  In  the  case  of  IA  dynamic  the  intuition  is  the  same,  however  it 
is  not  possible  to  set  priors  for  A (t)  as  the  sequence  { A(l) , ...,  A(T)}  is  going  to  be  determined 
by  A(0)  alone.  The  solution  is  then  to  extract  from  {jc(1),  ...,  x{T)}  information  about  their  local 
dynamical  behavior  and  use  it  to  calibrate  informative  priors  for  {e(t),t  >  1}.  Technically,  we 
set  e(i,  t )  as  independent  log-Normals;  we  use  the  facts  that  product  convolution  of  log-Normals 
is  log-Normal  (equation  4),  and  that  log-Normal(6i(i,  t),d2(i,  t))  =  exp{  N(61(i ,  t),  d2(i,  £))  }  to 
solve  the  convolution  problem  exactly  for  t),  92(i,  t)),  for  i  =  In  other  words,  values 

for  (0i(t),  02 (t) )  are  computed  from  (x(t),  x(t  —  1))  at  each  time,  and  these  parameters  need  not 
be  learned.  9(i,  0)  is  set  to  be  the  average  of  corresponding  OD  flow  {x(i,  t),  t  >  1}. 

Notice  that  every  two-stage  method  that  finds  preliminary  estimates  and  refines  them  uses 
(je(1),  ...,  x(T) }  in  the  second  stage,  in  some  way.  It  is  preferable  to  translate  this  information 
into  information  about  the  means  of  the  OD  flows  (A(l), ...,  A(T)},  according  to  the  intuition  that 
preliminary  estimates  can  identify  a  smooth  version  of  the  OD  flows  we  are  looking  for,  which 
make  reasonable  guesses  for  their  underlying  average  processes. 
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5.1.3  Estimation  and  Inference 

The  estimation  strategy  involves  two  stages.  In  the  first  stage  we  find  preliminary,  smooth  estimates 
for  the  OD  flows,  which  make  a  good  guess  for  the  averages  of  the  OD  traffic.  In  the  second  stage 
we  refine  these  smooth  estimates  by  looking  for  spikes  and  bursty  periods  with  one  single  pass 
over  the  data. 
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Figure  5.5:  A  non-parametric  empirical  Bayes  approach  to  the  filtering  problem  is  at  the  core  of 
the  inverse  allocation  dynamic  model. 

The  IA  dynamic  model  is  a  Bayesian  dynamical  systems;  EM  and  particle  filter  can  be  used 
for  estimation  and  inference.  The  implementation  includes  skewed  models  as  Gamma  and  log- 
Normal,  a  wide  selection  of  priors  as  Uniform,  Normal,  Gamma  and  log-Normal,  and  several 
resampling  schemes  to  further  validate  the  results  on  top  of  the  main  particle  filter.  Ghahramani 
and  Hinton  (1996)  show  how  to  learn  all  the  parameters  in  the  linear  Gaussian  system  5.3,  in 
our  case  F,  Q ,  m,  and  V,  by  means  of  the  EM  algorithm.  Higuchi  (2001)  shows  how  a  self¬ 
organizing  system  can  be  built  from  non-linear  non-Gaussian  systems,  so  that  all  the  relevant 
parameters  are  learned  during  the  filtering  process.  Gilks  and  Berzuini  (2001)  propose  a  particle 
filter  that  keeps  particles  diverse.  More  specifically,  we  use  the  linear  Gaussian  SSM  and  related 
EM  steps  proposed  in  Airoldi  (2003),  which  includes  the  model  in  Cao  et  al.  (2000)  as  a  special 
case,  to  obtain  smooth  estimates  of  the  OD  traffic,  and  we  then  use  these  estimates  to  calibrate 
informative  priors  for  the  parameters  underlying  the  dynamic  of  a  non-Gaussian  system,  in  non- 
parametric  empirical  Bayes  fashion.  Eventually  the  particle  filter  makes  good  use  of  these  priors 
and  of  the  skewed  models,  and  finds  a  sequence  of  better  posterior  distributions  for  the  traffic  flow 
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on  each  OD  route;  we  pick  their  means  as  point  estimates. 

In  order  to  filter  the  posterior  distributions  of  the  origin-destination  flows  and  estimate  the  pa¬ 
rameters  of  the  models,  I  used  a  variation  of  the  sample-resample-move  algorithm  of  Gilks  and 
Berzuini  (2001),  briefly  outlined  below.  For  simplicity  define  v(t)  to  be  the  vector  of  all  param¬ 
eters  in  the  model  at  time  t,  v(t)  :=  (x(t),  A(t),  e(t),  0(f)),  and  v(0)  :=  (A(0)).  The  enhanced 
particle  filter  algorithm  is  as  follows.  At  f  =  0  generate  N  particles  {  A(j)(0)  using  0(0),  a. 
Then  iterate, 

1.  Set  f  =  f  +  1.  Move  each  particle  like  so:  (a)  generate  e^(t)  using  (0i,(q(f),  0i,(i)(f)), 
and  0(i)(f);  (b)  compute  \i)(t)  using  the  equation  4;  (c)  generate  x^(t)  by  sampling  from 
equation  5. 

2.  Resample  N  new  particles  from  {  V(i)(t)  according  to  the  likelihood,  P(  y(t )  \  ) ), 

they  entail. 

3.  Move  the  new  set  of  particles  according  to  a  MCMC  for  ’’several  steps”  to  improve  their 
diversity.  Go  to  1. 

For  details  about  the  MCMC  see  Airoldi  (2003). 

Scalability  and  Irreducibility  A  recent  result  in  network  tomography  (Cao  et  al.,  2001)  states 
that  it  is  possible  to  reformulate  filtering  problems  corresponding  to  large  networks  as  a  sequence 
of  problems  corresponding  to  small  networks.  As  a  consequence  of  it,  the  following  result  is  true. 

Lemma  1.  The  complexity  of  the  learning  algorithm  for  the  dynamic  IA  model  is  ()(  k  ■  T ). 

Proof  The  result  in  Cao  et  al.  (2001)  implies  that  a  tomography  problem  corresponding  to  a  net¬ 
work  with  k  origin-destination  flows  is  equivalent  to  (){n)  tomography  problems,  which  corre¬ 
spond  to  disjoint  sub-sets  of,  say,  one  to  four  OD  traffic  flows  in  the  original  problem.  This  fact 
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along  with  the  fact  that  our  solution  is  linear  in  the  number  of  time  points  for  which  the  OD  traffic 
need  be  filtered,  yields  a  total  complexity  of  0(  k  ■  T )  for  the  learning  algorithm  of  the  dynamic 
IA  model.  □ 

Lemma  1  implies  if  we  solve  the  inverse  problem  for  small  size  networks,  we  immediately  solve 
it  for  arbitrary  size  networks  with  comparable  estimation  errors.  Further  the  following  result  holds. 

Lemma  2.  The  inference  strategy  is  based  on  an  irreducible  MCMC. 

Proof.  See  Appendix  A.  □ 

Lemma  2  implies  that  the  proposed  inference  strategy  is  able  to  explore  the  support  of  the  whole 
joint  posterior  distribution  of  the  OD  flows.  Note  that,  as  hinted  in  the  introduction  to  the  problem, 
this  fact  cannot  be  taken  for  granted  in  inverse  problems;  being  able  to  identify  and  explore  the 
space  of  solutions  is  an  issue  that  needs  be  addresses,  problem  by  problem.  Furthermore,  the 
MCMC  uses  a  Gibbs  sampler  with  Metropolis  steps. 

Discussion  of  Experimental  Evidence  The  methods  were  tested  on  two  data  sets;  both  included 
validation  data. 

•  Carnegie  Mellon  traffic:  the  first  data  set,  which  we  used  to  choose  the  appropriate  model, 
contained  about  12100  origin-destination  traffic  flows  measured  every  5  minutes  over  slightly 
less  than  two  days  at  Carnegie-Mellon  university  (CMU).  We  measured  an  average  traffic  of 
14GB  every  5  minutes. 

•  AT&T  traffic:  the  second  data  set,  which  we  used  to  test  and  compare  the  filtered  traffic 
obtained  with  different  methodologies,  contained  16  origin-destination  flows  measured  every 
5  minutes  over  a  one-day  period  at  AT&T,  courtesy  of  Dr.  Jin  Cao  at  Bell  Labs. 
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The  analysis  of  Carnegie  Mellon  origin-destination  traffic  flows  supports  the  hypothesis  of  a  very 
skewed  distribution.  In  figure  6  we  plotted  the  logarithms  of  the  observed  flows  versus  the  log¬ 
arithms  of  the  number  of  times  measurements  of  such  a  size  appear  (aka.  log-log  plot),  after 
discarding  the  measurements  smaller  than  a  standard  packet  (53  bytes  =  424  bytes).  The  log-log 
plot  indicates  a  log-Normal  distribution  may  be  appropriate.  A  histogram  of  the  logarithms  of  the 
flows  indicated  that  a  logarithmic  transformation  is  actually  too  mild  to  remove  all  the  skewness, 
and  a  double  logarithmic  transformation  would  be  more  appropriate.  The  AT&T  data  set  is  much 
smaller,  and  contains  traffic  flows  generated  on  a  smaller  network;  they  are  less  skewed  overall, 
and  a  logarithmic  transformation  is  enough  to  yield  a  symmetric  histogram  for  the  truncated  flows. 
The  CMU  data  set  was  used  to  inform  model  development.  The  AT&T  data  set  was  then  used  as 
an  independent  model  validation  data  set. 

The  full  story  about  the  data  sets  is  presented  elsewhere  (Airoldi,  2003;  Airoldi  and  Faloutsos, 
2004);  here  I  will  focus  on  findings  that  bear  relevance  to  the  methodological  issues.  In  particular, 
few  discussion  points  emerge  that  are  shared  by  dynamic  hierarchical  models  in  applications  to 
inverse  problems. 
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Figure  5.6:  Log-log  plots  of  the  12100  traffic  flows  measured  at  Carnegie-Mellon  (left  panel)  and 
of  the  16  traffic  flows  measured  at  AT&T  (right  panel). 


161 


5. 1 .  DYNAMIC  NETWORK  TOMOGRAPHY 


E.M.  AIROLDl 


1.  Skewed  Marginals:  what  is  the  impact  of  skewed  model  on  the  accuracy  of  the  estimates? 
And  what  is  the  best  model  for  the  OD  traffic? 

2.  Time  Dependence:  what  is  the  impact  of  explicit  time  dependence  on  the  accuracy  of  the 
estimates? 

3.  Informative  Priors:  what  constraints  should  we  impose  to  solve  the  regularization  problem? 
How  do  they  impact  the  accuracy  of  our  estimates? 

The  inferences  obtained  with  different  methods  were  compared  by  computing  the  distance  be¬ 
tween  the  true  OD  flows  in  the  validation  set  and  the  estimates.  The  best  results  were  obtained 
with  log-Normal  distribution  for  the  flows  and  Gaussian  vague  priors. 

To  isolate  the  effect  of  realistic  distributions  for  the  OD  flows,  we  compared  the  estimates 
obtained  with  IA  where  no  time  dependence  was  assumed,  for  Gamma  and  log-Normal  models, 
and  a  variety  of  non-conjugate  priors  (Uniform,  Gaussian,  Gamma, 1  log-Normal)  and  different 
pa-rametrizations,  with  the  estimates  obtained  by  local  likelihood.  Introducing  realistic  model 
reduced  the  error  between  25.4%  and  40.8%.  To  isolate  the  effect  of  explicit  time  dependence, 
we  compared  the  estimates  we  obtained  with  the  augmented  gaussian  model  that  uses  independent 
AR(1)  processes  for  the  OD  flows,  with  the  estimates  obtained  with  local  likelihood.  Introducing 
time  dependence  reduced  the  error  by  15.5%  on  average;  the  reduction  ranged  between  8.5%  and 
31.0%.  Using  the  static  IA  model  in  60%  of  the  time  points  uninformative  priors  yield  flat  or 
multi-modal  posteriors,  whereas  in  the  remaining  40%  of  the  time  points  flat  priors  yield  wide 
uni-modal  posteriors.  The  main  effect  of  the  data  at  y(t)  on  the  posterior  P(x(t)\y(t))  is  on  its 
range;  impossible  configurations  receive  zero  posterior  probability.  Informative  priors  with  wide 
variance  all  yield  uni-modal  distributions.  The  dynamic  IA  model  with  informative  priors  has 
the  advantage  of  requiring  fewer  particles  than  the  version  based  on  flat  priors;  knowing  where  to 
sample  may  introduce  bias,  but  the  thick  tails  of  the  log-Normal  distribution  of  both  x(t)  |  A(t),  f{t) 
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in  Local  Likelihoood  (Gaussian)  o  Linear  S5M  (Gaussian) 

B  IA  Model  Static  □  IA  Model  Static  +  Resampling 

□  IA  Model  Dynamic 


m  Local  Likelihoood  (Gaussian)  B  Linear  SSM  (Gaussian) 

B  IA  Model  Static  □  IA  Model  Static  ♦  Resampling 

□  IA  Model  Dynamic 


Figure  5.7:  The  bars  represent  the  average  estimation  error  in  a  validation  set.  Specifically  we  plot 
the  I2  distance  between  the  true  OD  flows  and  the  corresponding  estimates  obtained  with  the  local 
likelihood  approach  and  IA  models  in  its  various  flavors.  IA  based  on  the  Bayesian  dynamical 
system  is  a  clear  winner.  In  both  panels  we  include  the  estimates  obtained  with  the  augmented 
Gaussian  state-space  model.  Error  bars  in  the  left  panel  correspond  to  IA  models  based  on  Gamma, 
whereas  error  bars  on  the  right  panel  correspond  to  IA  models  based  on  log-Normal. 


P(  x(3,244)  |  y(244)  ) 


P(  x(4,244)  |  y(244)  ) 


Figure5.8:  Example  posterior  distributions  for  the  OD  flows  a:  (3,  244)  and  x(4,  244).  The  traffic  on 
the  X  axes  is  measured  in  Kbytes,  and  the  figures  show  the  posterior  distributions  we  obtained  with 
non-informative  priors  (top  panel)  and  with  informative  priors  (bottom  panel)  calibrated  using  our 
Gaussian  linear  SSM.  The  solid  triangles  represent  the  true  hidden  OD  Flows,  whereas  our  point 
estimates  would  be  the  means  of  the  posterior  distributions.  Making  the  posteriors  more  unimodal 
improves  the  estimates  by  reducing  the  bias  entailed  by  extra  modes. 


and  02(t)  mitigate  the  problem,  and  IA  captures  several  of  the  hidden  spikes. 
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P(  x(2,255)  |  y(255) )  P(  x(2,255)  |  y(1 ) . y(255) ) 


Figure  5.9:  Example  posterior  distributions  for  the  OD  flow  ,z'(2.  255).  The  traffic  on  the  X  axes  is 
measured  in  Kbytes,  and  the  figures  show  the  posterior  distribution  we  obtained  with  IA  static  (left 
panel)  versus  the  one  we  obtained  with  IA  dynamic  (right  panel).  The  solid  triangles  represent  the 
true  hidden  OD  Flow,  whereas  the  empty  triangles  are  our  point  estimates,  which  correspond  to 
the  means  of  the  posterior  distributions.  Making  use  of  all  the  observations  {  y(l), y(255)  }  in 
computing  the  posterior  distribution  in  the  right  panel  reduced  its  variability  —  notice  the  different 
ranges  —  thus  improving  the  inferences. 
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Figure  5.10:  The  learning  algorithm  for  IA  models  scales  linearly  with  the  problem  size  (number 
of  time  ticks). 

Briefly,  we  recover  a  smooth  version  of  the  OD  flows,  we  calibrate  informative  priors  for  some 
crucial  parameters,  and  eventually  we  use  a  dynamical  Bayesian  system  to  refine  the  estimates  and 
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capture  bursty  traffic.  This  methodology  allows  us  to  combines  the  three  simple  ideas  above:  a  re¬ 
alistic  model  for  the  data,  the  use  of  a  filtering  scheme  which  takes  advantage  of  time,  probabilistic 
constraints  to  overcome  the  under-determinacy  of  the  problem.  In  the  first  stage  we  use  the  Gaus¬ 
sian  linear  SMM  proposed  in  Airoldi  (2003),  and  we  calibrate  informative  priors  for  A (£)  using 
these  estimates.  These  priors  incorporate  information  about  the  magnitude  and  the  dynamical  be¬ 
havior  of  the  first  stage  smooth  estimates,  and  softly  constrain  the  location  of  the  average  processes 
{ A ( /; ) ,  t  >  1}.  Other  methods  proposed  in  the  literature  make  use  of  preliminary  estimates,  but 
they  only  retain  the  information  about  the  magnitude  of  the  OD  flows  given  by  the  such  estimates 
in  the  refining  stage  —  see  for  example  Zhang  et  al.  (2003)  who  use  shrinkage  to  improve  the 
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Figure  5.11:  Example  fits :  actual  latent  flows  (solid  black  lines)  versus  reconstructed  flows  (dashed 
red  lines).  IA  manages  to  reconstructs  several  spikes. 
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solutions  given  by  a  gravity  model.  In  our  method,  the  fact  that  we  retain  also  the  information 
about  the  local  dynamical  behavior  yields  a  significant  jump  in  the  final  accuracy.  Another  channel 
through  which  informative  priors  help  achieve  a  better  accuracy  is  by  reducing  the  bias  entailed 
by  multiple  modes  in  the  posterior  distributions.  Making  the  posteriors  more  uni-modal  improves 
the  precision  of  the  point  estimates  of  the  OD  flows  (the  posterior  means)  as  we  show  in  figure 
10  below.  Informative  priors  do  drive  the  inferences  about  the  OD  flows  towards  the  preliminary 
guesses,  however  the  two  layers  of  our  model  and  the  use  of  soft  probabilistic  constraints  entail 
enough  flexibility  to  capture  several  of  the  spikes  in  many  cases,  for  an  example  see  figure  12 
below.  Further  our  first-stage  estimates  are  safely  based  on  a  model  which  entails  a  one-to-one 
relationship  between  OD  flows  and  measurements,  as  it  includes  the  model  by  Cao  et  al.  (2000)  as 
a  special  case.  In  the  second  stage  the  primary  object  of  interest  become  the  sequence  of  posterior 
distributions  P(  x(t)  \  y(  1), ...,  y{t) ).  We  use  their  means  x(t)  =  E(  x(t)  \  y(  1), ...,  y{t) )  as  point 
estimates  for  the  OD  flows  at  time  t.  The  Bayesian  dynamical  system  brings  further  improvements, 
as  we  show  in  figure  7  above,  due  to  the  fact  that  we  make  use  of  all  the  observations  up  to  time  t  in 
computing  the  posterior  distributions  P(  x{t)  \  y{  1), ...,  y(t) );  conditioning  on  more  observations 
yields  a  narrower  variability.  Local  methods  use  fewer  observations  in  a  short  window  around  t, 
instead. 

Concluding,  experimental  evidence  shows  that  the  improvement  IA  models  achieve  goes  be¬ 
yond  the  contribution  of  state-of-the-art  methods  even  when  combined  with  recent  resampling 
schemes  which  improve  any  given  set  of  estimates.  The  modeling  choices  behind  IA  models  are 
intuitive;  first-stage  estimates  capture  smooth  average  processes,  second-stage  estimates  capture 
the  spikes.  Last,  the  estimation  strategy  of  the  dynamic  IA  model  provides  some  insight  in  how 
to  calibrate  informative  priors  in  Bayesian  systems,  where  no  clear  guidance  about  the  dynamic  of 
the  latent  variables  is  available. 
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5.2  Co-Evolving  Systems 


The  idea  here  is  to  revisit  a  classical  model  of  social  interactions  and  their  evolution  based  on 
constructional  theory  (Carley,  1990,  1991),  and  to  explore  whether,  and  to  what  extent,  its  spec¬ 
ifications  fit  within  the  statistical  framework  presented  in  the  previous  sections.  In  doing  so,  few 
points  of  discussion  emerge  that  suggest  a  wide  applicability  of  this  approach. 

The  basic  constructional  model  explains  the  dynamics  of  social  interactions  using  three  basic 
forces:  (i)  social  interactions  lead  to  shared  knowledge;  (ii)  similar  individuals  tend  to  interact,  and 
the  more  individuals  interact  the  more  similar  they  become;  (iii)  global  social  consensus  emerges 
from  diverse  local  conditions.  Elements  of  the  model  portray  a  simplified  society  with  N  individ¬ 
uals.  Culture  is  described  in  terms  of  individuals’  knowledge  about  K  facts,  at  any  given  period 
t,  and  encoded  by  Bernoulli  variables  /*(ra,  k )  specific  to  individual-fact  pairs.  Social  structure  is 
defined  in  terms  of  individuals’  probabilities  of  interaction  with  one  another  at  any  given  period  t, 
and  encoded  by  scalars  m)  specific  to  pairs  of  individuals.  Social  structure  is  assumed  to  be 
a  deterministic  function  of  the  culture, 


p*(n,  m) 


E •  /4(o,  k)' 


(5.7) 


Actual  interactions  occur  at  any  given  period  t,  and  are  denoted  by  m).  Whenever  two  indi¬ 
viduals  interact,  each  shares  knowledge  about  a  single  fact  k,  chosen  uniformly  among  those  that 
are  known;  this  information  is  encoded  by  a  pair  of  Bernoulli  variables,  u*(n,  k),  ut( m ,  /).  And  so 
culture  evolves,  an  social  structure  changes. 

The  algorithm  that  specifies  the  evolution  of  social  structure  and  culture  in  this  model  is  as 
follows. 


1 .  At  epoch  t 
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Figure  5.12:  Graphical  representation  of  the  basic  constructional  model  (Carley,  1990,  1991)  and 
one  of  its  extensions  (Shreiber,  2006). 

1.1.  Compute  social  structure  Pl  given  culture  Ft 

1.2.  Sample  interactions  P  given  social  structure  P* 

1.3.  Sample  knowledge  exchanged  U*  given  interactions  P  and  culture  Ft 

1 .4.  Update  culture  Ft+1  given  previous  culture  Fl  and  knowledge  exchanged  U t 

The  left  panel  of  Figure  5.12  shows  how  the  basic  constructional  model  can  be  represented  in  the 
formalism  of  the  statistical  framework  presented  here. 

Remark  2  (On  Inference).  Recall  that  it  is  possible  to  characterize  the  models  of  graphs  based  on 
the  exponential  family  (Frank  and  Strauss,  1986;  Wasserman  and  Pattison,  1996)  with  the  formal¬ 
ism  of  undirected  graphical  models  (Airoldi,  2006;  Hanneke  and  Xing,  2007).  The  inference  for 
the  models  of  dynamics  and  evolution  suggested  by  the  constructional  model  of  social  interactions 
is  tractable,  although  possibly  computationally  expensive:  as  long  as  we  make  use  of  probability 
distributions  within  the  exponential  family  we  can  compute  derivatives  and  likelihood  and  devise 
the  corresponding  EM  algorithm — using  approximation  strategies  such  as  variational  methods  and 
MCMC  where  necessary 

The  constructional  model  of  social  interactions  is  essentially  a  data  generating  process  that 
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involves  probabilistic  events  and  regularities;  as  such,  its  specifications  can  be  subsumed  within 
the  statistical  framework  presented  here.  However,  the  goals  of  analyses  that  such  a  model  (and 
its  extensions)  allows  are  not  restricted  to  inference  and  parameter  estimation.  Through  simulative 
experiments,  for  example,  the  constructional  model  allows  to  explore  the  space  of  possibilities  that 
is  consistent  with  a  given  set  of  structural  hypotheses  (Shreiber,  2006). 

*  *  * 

In  this  chapter,  I  described  a  strategy  to  introduce  simple  dynamics  and  evolution  in  the  models 
of  complex  graphs  developed  so  far.  Edges  within  a  network  are  no  longer  exchangeable  in  this 
temporal  setting;  exchangeability  is  substituted  by  other  dependence  structures. 

I  demonstrated  how  to  fully  specify  model  of  network  evolution — the  Inverse  Allocation  model 
of  Section  5.1 — to  solve  an  open  problem  in  the  context  of  dynamic  network  tomography.  Net¬ 
work  tomography  constitutes  an  interesting  application  where  a  locally  smooth  dynamic  behavior 
serves  as  the  crucial  constraint  that  allows  the  accurate  estimation  of  origin-destination  traffic  from 
few  aggregate  traffic  measurements.  A  conditional  marked  point  process  accommodates  the  extra 
variability  due  to  bursts  in  the  traffic. 

I  presented  an  overview  of  alternative  (more  complex)  temporal  modeling  strategies  and  dis¬ 
cussed  the  extent  to  which  they  provide  a  conceptual  bridge  between  statistical  models  and  agent- 
based  models.  I  believe  that  this  conceptual  linkage  suggests  a  new  approach  to  calibration  and  val¬ 
idation  issues  that  arise  in  agent-based  models  and  simulations  in  general  that  is  rooted  in  Bayesian 
statistics. 
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Chapter  6 


Concluding  Remarks 


This  thesis  provides  a  methodological  framework  for  the  statistical  analysis  of  complex  graphs  and 
dynamic  networks.  In  it,  I  developed  probabilistic  algorithms  that  generate,  evolve  and  integrate 
a  heterogeneous  collection  of  graphs,  I  studied  the  statistical  models  these  algorithms  implicitly 
specify,  and  I  developed  strategies  for  estimating  the  set  of  quantities  on  which  they  depend. 


6.1  Conclusions 

I  have  described  a  statistical  approach  to  the  analysis  of  complex  systems.  As  it  has  emerged  from 
the  examples  and  case  studies  (either  presented  in  details  or  referred  to  in  published  work)  most 
of  the  models  introduced  here  are  tailored  to  the  analysis  of  complex  systems  and  their  evolution, 
with  special  emphasis  on  applications  to  social  and  biological  networks.  The  goals  of  the  analysis 
in  the  various  cases  is  different,  but  there  is  a  binding  theme:  that  of  revealing  non-observable 
mechanisms  underlying  social  and  biological  processes  by  integrating  a  heterogeneous  collection 
of  measurements  about  diverse  signals,  i.e.,  networks,  sequences,  and  attributes.  Applications  of 
the  models  presented  here  in  the  context  of  biological  systems  will  be  the  main  focus  of  future 
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research. 

From  a  methodological  perspective  I  introduced:  (i)  models  for  the  analysis  of  complex  net¬ 
works;  (ii)  models  for  the  analysis  of  multivariate  attributes;  (iii)  strategies  for  integrating  het¬ 
erogeneous  measurements;  and  (iv)  models  for  the  evolution  of  the  system,  within  a  coherent 
statistical  framework.  There  are  few  basic  ideas  that  get  combined  in  various  guises  to  derive  full 
model  specifications  in  this  framework:  (i)  mixed  membership;  (ii)  latent  patterns;  (iii)  hierarchi¬ 
cal  structure  in  the  likelihood;  (iv)  dynamics;  and  (v)  sparsity.  I  found  these  ideas  to  be  useful  in 
applications  to  social  and  biological  systems. 

In  future  research,  I  plan  to  explore  fundamental  technical  issues  that  are  shared  by  Bayesian 
mixed  membership  models,  and  to  some  degree  by  hierarchical  Bayesian  mixture  models. 


6.2  Technical  Issues 

In  working  with  latent  aspects  models  of  the  sort  described  in  this  thesis,  I  have  encountered  four 
themes  of  a  technical  nature:  (i)  the  mixed  membership  of  objects  to  patterns,  and  the  related 
allocation  task;  (ii)  model  selection  and  model  choice;  (iii)  the  presence  of  many  local  peaks 
in  the  likelihood,  and  strategies  for  finding  one  with  a  good  substantive  interpretation;  and  (iv) 
scalability  of  the  approach  to  very  large  data  sets.  I  briefly  touch  upon  each  of  these  in  the  following 
subsections.  The  context  in  each  case  is  given  by  a  specific  model,  but  the  discussion  and  results 
generalize  to  other  models. 

6.2.1  The  Geometry  of  Allocation 

The  allocation  task  has  a  central  role  in  the  latent  aspects  models  described  in  this  thesis;  resolving 
this  task  is  equivalent  to  estimating  the  mixed  membership  map  between  objects  and  latent  patterns. 
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Intuitively,  we  allocate  objects  to  categories  and  we  introduce  a  new  category  when  the  fit  is  bad 
on  some  scale  using  the  current  number  of  categories.  It  is  possible  to  characterize  the  notion  of 
allocation  in  terms  of  variance  components,  both  analytically  and  with  simulations. 

Example  29.  In  the  classical  Factor  Analysis  and  linear  Gaussian  state-space  models  it  is  possible 
to  derive  in  closed  form  the  projections  of  data  onto  the  lower  dimensional  spaces  of  factors  and 
states,  respectively.  The  projection  allocates  data  to  latent  components  according  to  the  entries  of 
the  various  variance-covariance  matrices  involved,  assuming  equal  component  weights.  Consider, 
for  example,  a  factor  analysis  model:  we  measure  D-dimensional  quantities  Y  =  AX,  compo¬ 
nents  of  which  are  assumed  to  be  sums,  through  the  matrix  A,  of  K -dimensional  latent  factors, 
X.  In  a  simple  formulation  of  the  problem  we  can  assume  unit  elements  A^p  =  1,  and  X  ~ 
Normal  (0,  £).  It  turns  out  that  the  allocation  of  observations  to  factors,  in  this  model,  is  resolved 
by  estimating  latent  factors  with  weighted  averages  of  the  observations,  E  [  Xqp  |  {F}^=1  ]  = 
ffd  tukd  !'(,/).  The  allocation  is  specified  through  the  optimal  ( mixed-membership )  weights,  which 
are  functions  of  the  elements  of  the  variance-covariance  matrix,  uj*kd  =  uj*k(Y,).  Note  that  the 
variance-covariance  matrix  £  is  estimated  as  well  ( e. g. ,  see  Appendix  A.3  in  Airoldi,  2003). 

The  seeming  analytical  intractability  of  these  models  presents  us  with  some  obstacles,  and 
opens  analytical  opportunities  at  the  same  time.  Below  I  provide  some  experimental  evidence  that 
is  suggestive  of  the  how  the  quality  of  the  allocation  of  objects  to  patterns  responds  to  the  quality  of 
the  assumptions  encoded  by  a  model.  In  the  future  I  plan  to  explore  the  extent  to  which  a  tractable 
lower  bound  for  the  log-likelihood  and  asymptotic  derivations  help  characterize  these  ideas. 

Experimental  Evidence:  Simulations  The  simulation  takes  place  in  the  context  of  models  of 
multivariate  attributes  I  developed  in  Section  4.1,  where  we  allocate  genes  to  temporal  expression 
profiles  using  models  that  encode  independence  among  occurrences  of  the  same  gene  versus  mod¬ 
els  of  contagion.  Simulative  experiments  suggest  that  Dirichlet-Poisson  model  of  Section  4. 1  is 


173 


6.2.  TECHNICAL  ISSUES 


E.M.  AIROLDl 


better  at  recovering  membership  than  the  independence  model  when  realistic  SAGE  mean/variance 
ratio  holds. 

We  first  validate  our  models  by  examining  to  what  extend  they  can  recover  the  mixed-mem¬ 
bership  probabilities  {9n},  i.e.,  the  soft  cluster  assignments  of  each  gene,  under  various  simulated 
conditions.  We  generated  the  ground  truth  using  our  generative  processes,  and  we  focused  on  sce¬ 
narios  where  the  “mean”  expression  level  at  the  various  epochs  was  lower  than  its  corresponding 
“variance” —  a  realistic  biological  experimental  scenario.  We  compare  our  models,  normalized 
DiP  and  conditional  DiP,  with  two  other  methods,  the  independence  model  (Pritchard  et  al.,  2000; 
Minka  and  Lafferty,  2002;  Blei  et  al.,  2003),  and  the  PoissonL  model  (Cai  et  al.,  2004).  Our  mod¬ 
els  yield  higher  likelihoods  of  expression  profiles  in  the  test  set  (not  shown),  and  more  accurate 
predictions  of  the  latent  theme  id  of  each  gene  based  on  their  observed  expression  levels.  Out  of 
1000  genes  we  simulated,  for  example,  nDiP  and  cDiP  achieved  75.95%  and  70.32%  accuracy, 
respectively,  whereas  the  independence  model  reached  only  63.25%.  Strikingly,  the  independence 
model  clustered  all  genes  in  one  profile  in  several  runs. 


Experimental  Evidence:  A  20-gene  Synthetic  Data  Set  In  small  samples  bearing  realistic 
SAGE  characteristics,  although  the  recovered  clusters  differ  only  slightly,  the  estimated  mixed- 
membership  are  sharper  using  DiP  than  with  the  independence  model. 

Here  I  report  our  analysis  of  a  small  dataset  used  in  Cai  et  al.  (2004),  which  contains  the  ex¬ 
pression  profiles  of  20  genes  over  5  temporal  epochs.  Eighteen  of  the  20  genes  belong  to  one  of 
4  clusters  (temporal  themes),  and  the  2  remaining  two  are  identified  as  outliers.  The  expression 
profiles  are  generated  from  6  different  latent  themes,  or  clusters,  which  the  authors  reduce  to  4 
by  ignoring  the  abundance  of  the  gene  tags  observed  on  the  transcripts  sampled  at  each  epoch.  In 
particular,  there  are  3  profiles  from  theme  1,  4  from  theme  2,  6  from  theme  3,  and  6  from  theme  4. 
The  raw  data  is  plotted  in  Figure  6.1  on  various  scales.  Among  the  profiles  from  theme  2,  there  is 
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1  with  10  times  as  many  gene  tags  as  the  others,  and  similarly  for  theme  3 — number  7  and  number 
13  in  Figure  6.2.  Note  that  these  2  profiles  are  “more  expressed”  but  they  follow  an  expression 
theme  similar  to  the  other  expression  profiles  in  the  respective  clusters. 

Figure  6.2,  displays  the  4  themes  learned  by  the  normalized  and  conditional  DiP  models  (bottom- 
left  panel),  versus  those  learned  by  PoissonL  (Cai  et  al.,  2004)  and  the  independence  model  (top- 
left  panel).  A  rough  eyeballing  shows  that  the  gene  expression  themes  learned  by  DiPs  and  the  two 
competing  methods  are  similar.  However,  a  close  examination  reveals  the  following.  Arguably, 
we  obtain  a  more  compact  themes  3,  as  revealed  by  the  lower  degree  of  dispersion  among  genes 
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Figure  6.1:  The  raw  example  data  in  Cai  et  al.  (2004),  on  the  original  expression  scale  (left);  on  a 
normalized  expression  scale,  by  gene,  into  [0, 1]  (center);  and  on  a  normalized  expression  scale,  by 
epoch,  using  <ti:T  (right). 
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Figure  6.2:  Left:  Latent  gene  expression  themes  learned  by  different  algorithms.  Top:  4  themes 
(numbered  1  to  4  from  left  to  right)  learned  by  PoissonL  and  the  independence  model.  Each 
theme  is  represented  by  the  expression  profiles  of  all  the  genes  assigned  to  that  theme  base  on 
MAP  prediction  using  the  estimated  mix-membership  vector  9n.  In  this  case,  PoissonL  and  the 
independence  model  give  the  same  membership  prediction.  Bottom:  The  4  themes  discovered  by 
normalized  DiP  and  conditional  DiP.  Note  that  due  to  overlap  of  the  profile  curves,  the  ’’occupancy” 
number  of  each  theme  is  not  apparent  here.  But  in  Fig.  6.2,  one  can  see  it  more  clearly.  Right: 
The  estimated  membership  probabilities,  {9nk},  for  the  independence  model  (top),  nDiP  (middle), 
and  cDiP  (bottom).  Each  row  correspond  to  a  theme,  and  each  column  corresponds  to  a  gene.  The 
color  shades  of  the  cells  correspond  to  values  ranging  from  1  (black)  to  0  (white).  The  panel  shows 
that  cDiP  yields  the  sharpest  estimates. 


assigned  to  this  theme;  but  for  theme  2,  the  genes  assigned  to  it  by  the  independence  model  and 
PoissonL  are  slightly  more  consistent.  Overall,  the  software  clustering  assignment  of  each  gene 
are  compatible  across  all  4  algorithms,  and  as  shown  in  Figure  6.2),  but  the  mixed-membership 
probabilities  inferred  by  the  DiPs  for  each  gene  are  sharper.  If  we  compare  the  MAP  assignment 
of  each  gene  to  a  single  most  probable  themes,  the  19  of  the  20  genes  are  consistent  across  all  4 
algorithms,  and  their  assignments  agree  with  the  true  themes  label  given  by  the  original  dataset. 
The  remain  one,  gene  no.  10,  is  intriguing.  It  has  an  expression  profile,  {Y^5}  =  (4, 10, 16, 14,  6), 
and  is  originally  labeled  as  from  theme  2,  {A^5}  =  (10,  30,  30,  60, 10).  Apparently  profile  {Y^5} 
exhibits  great  variability  with  respect  to  its  supposedly  underlying  theme.  Using  DiP,  we  infer  the 
label  of  gene  no.  10  to  be  theme  3,  which  has  a  prototype  profile  {A^5}  =  (10, 10, 10, 10, 10),  and 
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indeed  we  found  much  of  the  variability  in  gene  10  is  related  to  the  overall  abundance  of  all  genes 
in  different  epochs,  rather  then  its  intrinsic  trend.  So  we  feel  this  assignment  is  arguable  more 
plausible  the  the  purported  theme  2.  As  shown  in  Figure  6.2),  the  independence  model  inferred  a 
split  assigned,  about  equally  probable  to  pattern  2  and  3. 

The  example  suggests  the  role  of  model  properties  in  latent  allocation  tasks.  The  intuition  is 
that  if  the  model  cannot  express,  on  average,  the  salient  properties  of  the  data,  then  it  may  lead  to 
artifactual  effects.  Specifically,  the  unexplained  variability  will  need  to  find  a  “place-holder”,  and 
it  seems  to  increase  the  variability  of  parameter  estimates. 

6.2.2  Model  Selection  Strategies  and  Issues 

Although  there  are  pathological  examples,  where  slightly  different  model  specifications  lead  to 
quite  different  analyses  and  choices  of  key  parameters,  in  real  situations  we  expect  models  with 
similar  probabilistic  specifications  to  suggest  roughly  similar  choices  for  the  number  of  patterns 
(Airoldi  et  al.,  2006e).  In  the  applications  presented  or  referred  to  throughout  this  thesis  I  explored 
the  issue  of  model  choice  by  means  of  different  criteria. 

Paremetric:  Choice  Informed  by  the  Ability  to  Predict  Cross-validation  is  a  popular  method 
to  estimate  the  generalization  error  of  a  prediction  rule  (Hastie  et  al.,  2001),  and  its  advantages  and 
flaws  have  been  addressed  by  many  in  that  context  (e.g.,  Ng,  1997).  More  recently,  cross-validation 
has  been  adopted  to  inform  the  choice  about  the  number  groups  and  associated  patterns  in  hierar¬ 
chical  Bayesian  models  (Barnard  et  al.,  2003;  Wang  et  al.,  2005).  Guidelines  for  the  proper  use  of 
cross-validation  in  choosing  the  optimal  number  of  groups  K,  however,  has  not  been  systematically 
explored.  One  of  the  goals  of  our  case  studies  is  that  of  assessing  to  what  extent  cross-validation 
can  be  “trusted”  to  estimate  the  underlying  number  of  topics  or  disability  profiles.  In  particular, 
given  the  non-negligible  influence  of  hyper-parameter  estimates  in  the  evaluation  of  the  held-out 
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likelihood,  i.e.,  the  likelihood  on  the  testing  set,  we  discover  that  it  is  important  not  to  bias  the 
analysis  with  “bad  estimates”  of  such  parameters,  or  with  arbitrary  choices  that  are  not  justifiable 
using  preliminary  evidence,  i.e.,  either  in  the  form  of  prior  knowledge,  or  outcome  of  the  analysis 
of  training  documents.  To  this  extent,  estimates  with  “good  statistical  properties,”  e.g.,  empirical 
Bayes  or  maximum  likelihood  estimates,  should  be  preferred  to  others  (Carlin  and  Louis,  2005). 
Alternative  approaches  based  on  the  predictive  ability  of  a  set  of  latent  patterns  have  been  recently 
proposed,  e.g.  in  the  context  of  clustering  (Tibshirani  and  Walther,  2005). 

Semiparametric:  Stochastic  Process  Priors  Positing  a  Dirichlet  process  prior  on  the  number 
of  latent  topics  is  equivalent  to  assuming  that  the  number  of  latent  topics  grows  with  the  log  of  the 
number  of,  say,  documents  or  individuals  (Ferguson,  1973;  Antoniak,  1974).  This  is  an  elegant 
model  selection  strategy  in  that  the  selection  problem  become  part  of  the  model  itself,  although  in 
practical  situations  it  is  not  always  possible  to  justify.  A  nonparametric  alternative  to  this  strategy, 
recently  proposed  (McAuliffe  et  al.,  2006),  uses  the  Dirichlet  Process  prior  is  an  infinite  dimen¬ 
sional  prior  with  a  specific  parametric  form  as  a  way  to  mix  over  choices  of  K .  This  prior  appears 
reasonable,  however,  for  static  analyses  of  scientific  publications  that  appear  in  a  specific  journal. 
Kumar  et  al.  (2000)  specify  toy  models  of  evolution  which  justify  the  scale-free  nature  of  the  rela¬ 
tion  between  documents  and  topics  using  the  Dirichlet  process  prior  for  exploratory  data  analysis 
purposes  (Kleinberg  et  al.,  1999;  Kumar  et  al.,  2000).  However,  has  to  be  noted  that  the  prior  on 
the  membership  of  the  patterns  induced  by  many  such  processes  is  not  always  desirable,  and  in 
certain  applications  is  wrong.  For  example,  in  biological  applications  to  protein  interaction  net¬ 
works,  the  latent  patterns  correspond  to  stable  protein  complexes  (i.e.,  groups  of  proteins)  that  are 
composed  of  4  to  7  proteins  on  average  (Krogan  et  al.,  2006). 

Other  Criteria  for  Model  Choice  The  statistical  and  data  mining  literatures  contain  many  other 
criteria  and  approaches  to  deal  with  the  issue  of  model  choice,  e.g.,  reversible  jump  MCMC 
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techniques,  Bayes  factors  and  other  marginal  likelihood  methods,  and  penalized  likelihood  cri¬ 
teria  such  as  the  Bayesian  Information  Criterion  (BIC)  (Schwartz,  1978;  Pelleg  and  Moore,  2000), 
the  Akaike  information  criterion  (AIC)  (Akaike,  1973),  the  deviance  information  criterion  (DIC) 
(Spiegelhalter  et  ah,  2002),  minimum  description  length  (MDL)  (Chakrabarti  et  al.,  2004).  See 
(Han  and  Kamber,  2000)  for  a  review  of  solutions  in  the  data  mining  community.  AIC  has  a  fre- 
quentist  motivation  and  tends  to  pick  models  that  are  too  large  when  then  number  of  parameters  its 
large — it  does  not  pay  a  high  enough  penalty.  BIC  and  DIC  have  Bayesian  motivations  and  thus 
fit  more  naturally  with  the  specifications  in  this  paper.  Neither  is  truly  Bayesian;  however  DIC 
involves  elements  that  can  be  computed  directly  from  MCMC  calculations,  and  the  variational 
approximation  to  the  posterior  (described  in  detail  below),  allows  us  to  integrate  out  the  nuisance 
parameters  in  order  to  compute  an  approximation  to  BIC  for  different  values  of  K . 

A  Simulation  Study  I  conclude  by  presenting  some  anecdotal  evidence  we  gathered  from  syn¬ 
thetic  data  with  the  aim  of  highlighting  the  dangers  of  fixing  the  hyper-parameters  according  to 
some  ad-hoc  strategy  that  is  not  supported  by  the  data,  e.g.,  fixing  a  =  50/A"  in  the  models  of 
the  Chapters  3  and  4.  I  simulated  a  set  of  3,000  documents  according  to  the  latent  dirichlet  alloca¬ 
tion  model  for  generating  textual  documents  described  in  Example  22,  with  K*  =  15  topics  (the 
patterns)  and  a  vocabulary  of  size  50.  I  then  fitted  the  correct  Bayesian  mixed-membership  model 
on  a  grid  for  K  =  5, 10,45  that  included  the  true  underlying  number  of  groups  and  associated 
patterns,  using  a  five-fold  cross-validation  scheme.  In  a  first  batch  of  experiments  I  fitted  alpha 
using  empirical  Bayes  Carlin  and  Louis  (2005),  whereas  in  a  second  batch  of  experiments  I  set 
a  =  50/ A",  following  the  analysis  in  Griffiths  and  Steyvers  (2004).  The  held-out  log-likelihood 
profiles  are  reported  in  Figure  6.3. 

In  this  controlled  experiment,  the  optimal  number  of  non-observable  topics  is  K*  =  15.  This 
implies  a  value  of  a  —  =  3.33  >  1  for  the  ad-hoc  strategy,  whereas  a  =  0.052  <  1  according 

to  the  empirical  Bayes  strategy.  Intuitively,  the  fact  that  a  >  1  has  a  disrupting  effect  on  the  model 
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Figure  6.3:  Left:  2D  symmetric  Dirichlet  densities  underlying  mixed-membership  vectors  6  = 
(6i,  02),  with  parameter  a  =  4  >  1  (solid,  black  line)  and  with  parameter  a  =  0.25  <  1  (dashed, 
red  line).  Right:  held-out  log-likelihood  for  the  simulation  experiments  described  in  the  text.  The 
solid,  black  line  corresponds  to  the  strategy  of  fixing  a  =  50/ K,  whereas  the  dashed,  red  line 
corresponds  to  the  strategy  of  fitting  a  via  empirical  Bayes.  K*  is  denoted  with  an  asterisk. 


fit:  each  topic  is  expected  to  be  present  in  each  document,  or  in  other  words  each  document  is 
expected  to  belong  equally  to  each  group/topic,  rather  than  only  to  only  a  few  of  them,  as  it  is 
the  case  when  a  <  1.  As  an  immediate  consequence,  the  estimates  of  the  components  of  mixed- 
membership  vectors,  {6nk},  tend  to  be  diffuse,  rather  than  sharply  peaked,  as  we  would  expect 
in  text  mining  applications.  Furthermore,  in  this  simple  simulation,  setting  the  hyper-paramter  a 
to  a  value  greater  than  one  when  the  data  supports  values  in  a  dramatically  different  range,  e.g., 
0.01  <  a  <  0.1,  ultimately  bias  the  estimation  of  the  number  of  latent  patterns.  Figure  6.3  shows 
that,  ultimately,  the  empirical  Bayes  strategy  correctly  recovers  K*  =  15,  whereas  the  ad-hoc 
strategy  finds  recovers  an  erroneous  number  of  latent  patterns  K *  =  20. 

Concluding,  experiments  in  a  controlled  setting  suggest  that  it  is  desirable  not  to  fix  the  hyper¬ 
parameters,  e.g.,  the  non-observable  category  abundances  a,  according  to  ad-hoc  strategies,  unless 
such  strategies  are  supported  by  previous  analyses.  Ad-hoc  strategies  will  affect  inference  about 
the  number  of  non-observable  patterns  in  non-controllable  ways,  and  ultimately  bias  the  analysis 
of  data  and  the  substantive  conclusions.  This  effect  can  be  observed  in  a  real  problem  setting  in 
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Figure  6.4:  The  non-parametric  empirical  Bayes  approach  at  a  glance. 

Example  25  of  Chapter  4  by  looking  at  the  entries  in  Figure  4.10.  The  plots  in  the  right  column 
display  latent  topics  that  were  estimated  using  the  strategy  of  fixing  a;  they  are  visibly  more 
diffuse  than  the  topics  estimated  by  fitting  the  hyper-parameter  a  using  the  empirical  Bayes  strategy 
likelihood — plots  in  the  left  column. 

6.2.3  Nonparametric  Empirical  Bayes 

An  issue  with  mixture  models  is  that  of  multiple  local  peaks  (e.g.,  Buot  and  Richards,  2006a, b). 
Depending  on  the  signal-to-noise  ratio  in  the  data,  this  can  lead  to  problematic  inferences.  How¬ 
ever,  even  in  those  cases  where  the  signal  is  buried  in  the  noise  it  is  possible  to  adopt  estimation 
and  inference  strategies  that  minimize  the  issue. 

Example  30.  In  the  application  of  the  admixture  of  latent  blocks  model  to  protein  interaction  net¬ 
works  (Airoldi  et  al.,  2006c)  a  two-stage  approach  is  used;  a  model  with  no  interactions  among 
protein  in  different  complexes  is  fit  first,  i.e.,  B  is  constrained  to  be  the  identity,  and  then  the  full 
model  is  fit,  i.e.,  B  is  unconstrained.  In  the  second  stage,  the  mixed  membership  map  is  initialized 
to  that  recovered  in  with  the  simpler  model.  In  this  model,  the  strategy  aims  at  resolving  the  infer¬ 
ence  among  the  two  competing  explanations  for  the  interactions;  namely,  the  mixed  membership 
map  between  protein  and  stable  protein  complexes,  and  the  block  model  that  encodes  interactions 
among  proteins  in  different  complexes. 

In  order  to  perform  inference  in  the  models  presented  a  multiple- stage  approach  to  estimation 
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and  inference  is  adopted — see  Figure  6.4.  In  general,  the  non-parametric  empirical  Bayes  approach 
is  an  engineering  solution.  The  approach  suggests  fitting  a  sequence  of  models,  from  most  simple 
to  most  complex,  which  are  not  necessarily  nested.  The  results  of  estimation  and  inference  in 
simpler  models  is  used  to  inform  (or  calibrate)  priors  for  teh  parameters  in  the  more  complex 
models.  In  future  work  I  plan  to  quantify  both  (i)  the  effect  of  signal-to-noise  ratio  that  is  needed 
to  cause  problems,  and  its  interaction  with  (ii)  the  effect  of  the  distance  between  the  true  model  the 
starting  model  on  the  probability  of  successful  estimation  and  inference. 


6.2.4  Scalability 

The  scalability  of  posterior  inference  algorithms  in  models  of  relational  data,  e.g.,  the  stochastic 
block  models  of  mixed  membership  of  Section  4.2.2,  is  a  crucial  practical  issue  given  the  size  of 
social  and  biological  networks  of  interest  that  arise  in  modern  applications.  Below,  I  illustrate  a 
possible  solution  to  perform  fast  posterior  inference  in  the  context  of  a  specific  model.  Notably, 
the  proposed  nested  variational  inference  strategy  is  applicable  to  other  models  of  relational  data 
and  makes  posterior  inference  feasible  in  applications  that  involve  large  graphs  and  networks. 

Consider  the  admixture  of  latent  blocks  model  of  Section  3.1.  To  achieve  fast  convergence 
of  the  proposed  posterior  inference  algorithm  in  that  case,  I  employed  a  highly  effective  nested 
variational  inference  scheme  based  on  a  non-trivial  scheduling  of  variational  parameters  updating. 
The  resulting  algorithm  is  also  parallelizable  on  a  computer  cluster. 

In  a  naive  iteration  scheme  for  variational  inference,  one  would  initialize  the  variational  Dirich- 
let  parameters  pi:N  and  the  variational  multinomial  parameters  {op^q.  <i>P~q)  to  non-informative 
values,  and  then  iterate  until  convergence  the  following  two  steps:  (i)  update  (f)p^q  and  op~q  for 
all  edges  (p.  q ),  and  (ii)  update  pp  for  all  nodes  p  E  A f.  In  such  algorithm,  at  each  variational 
inference  cycle  we  need  to  allocate  NK  +  2 N2K  scalars.  Experimental  evidence  (Airoldi  et  al., 
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Outer  loop 

T  initialize  7^,  =  ^  for  all  p,  k 

2.  repeat 

3.  for  p  =  1  to  N 

4.  for  q  =  1  to  N 

5 .  get  variational  and  =  f  (  R(p,q),%,fq,  Bl  ) 

6.  partially  update  1 , 7'+ 1  and  Bt+1 

7.  until  convergence 


Figure  6.5:  The  nested  (two-layered)  variational  inference  algorithm  for  7  and  The 

inner  layer  consists  of  Step  5.  The  function  g  is  described  in  details  in  Figure  6.6. 

2006d)  suggests  that  the  naive  variational  algorithm  often  fails  to  converge,  or  converges  after  a 
large  number  of  iterations.  I  attribute  this  behavior  to  a  dependence  that  the  two  main  assumptions 
(block  model  and  mixed  membership)  induce  between  71:Ar  and  B,  which  is  not  satisfied  by  the 
naive  algorithm.  Some  intuition  about  why  this  may  happen  follows.  From  a  purely  algorithmic 
perspective,  the  naive  variational  EM  algorithm  instantiates  a  large  coordinate  ascent  algorithm, 
where  the  parameters  can  be  semantically  divided  into  coherent  blocks.  Blocks  are  processed  in 
a  specific  order,  and  the  parameters  within  each  block  get  all  updated  each  time.1  At  every  new 
iteration  the  naive  algorithm  sets  all  the  elements  of  7^  equal  to  the  same  constant.  This  dampens 
the  likelihood  by  suddenly  breaking  the  dependence  between  the  estimates  of  parameters  in  71:JV 
and  in  that  was  being  inferred  from  the  data  during  the  previous  iteration. 

Instead,  the  nested  variational  inference  algorithm  maintains  some  of  this  dependence  that  is 
being  inferred  from  the  data  across  the  various  iterations.  This  is  achieved  mainly  through  a  differ¬ 
ent  scheduling  of  the  parameter  updates  in  the  various  blocks.  To  a  minor  extent,  the  dependence 
is  maintained  by  always  keeping  the  block  of  free  parameters,  (0p_>9,  4>P^q),  optimized  given  the 
other  variational  parameters.  Note  that  these  parameters  are  involved  in  the  updates  of  parameters 
in  7i:iv  and  in  B,  thus  providing  us  with  a  channel  to  maintain  some  of  the  dependence  among 

'Within  a  block,  the  order  according  to  which  (scalar)  parameters  get  updated  is  not  expected  to  affect  convergence. 
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Inner  loop 

1 .  initialize  (j>^q>g  =  4^q>h  =  for  all  g.  h 

2.  repeat 

3.  for  g  =  1  to  K 

4.  update  (jf+}q  oc  /i  (  &sp^q,  %,  B  ) 

5.  normalize  to  sum  to  1 

6.  for  h  —  1  to  K 

1.  update  (jfpt}q  oc  f2  (  %,  B  ) 

8.  normalize  (ppt}q  to  sum  to  1 

9.  until  convergence 


Figure  6.6:  Details  Step  5.  in  Figure  6.5;  the  inference  algorithm  for  the  variational  parameters 
[oCm  -  </C m)  corresponding  to  the  basic  observation  ynrn.  The  functions  ry,  and  g2  are  updates  for 
OAruj  an(l  0 nmh  described  in  the  text  of  Section  4.1.3. 


them,  i.e.,  by  keeping  them  at  their  optimal  value  given  the  data.  Further,  the  nested  algorithm 
has  the  advantage  that  it  trades  time  for  space  thus  allowing  us  to  deal  with  large  graphs;  at  each 
variational  cycle  we  need  to  allocate  NK  +  2 K  scalars  only.  The  increased  running  time  is  par¬ 
tially  offset  by  the  fact  that  the  algorithm  can  be  parallelized  and  leads  to  empirically  observed 
faster  convergence  rates.  This  algorithm  is  also  better  than  MCMC  variations  (i.e.,  blocked  and 
collapsed  Gibbs  samplers)  in  terms  of  memory  requirements  and  convergence  rates. 


Complexity  Recall  that  attribute  measurements  taken  on  individual  objects  in  a  population  of 
interest  can  be  represented  as  a  bipartite  graph,  and  that  relational  measurements  taken  on  pairs  of 
objects  in  a  population  of  interest  can  be  represented  as  a  unipartite  graph.  In  both  cases,  denote 
the  number  of  edges  in  the  graph  by  I,  the  number  of  objects  by  N,  the  number  of  attributes  by  M, 
the  number  of  latent  patterns  by  K,  and  the  number  of  iterations  till  convergence  of  the  posterior 
inference  algorithm  employed  by  T. 

In  summary,  the  complexity  of  fitting  a  model  of  multivariate  attributes  that  follows  the  general 
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specifications  of  Section  4.2.1  is 


0(1  +  NMKT  +  K2T  ) , 


whereas  the  complexity  of  fitting  a  model  of  multivariate  relations  that  follows  the  general  specifi¬ 
cations  of  Section  4.2.2  is 


0(1  +  N2KT  +  K2T  )  . 
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Appendix  A 


Proof  of  Lemma  2 


The  proof  is  based  on  the  following  result. 

Fact  1.  There  exists  a  permutation  p  of  the  columns  of  A(iXK)  such  that  [A]( i,p(j))  =  [A{  |  A2], 
where  Ai  is  [f  x  (?)  and  has  full  rank,  and  A2  is  (l  x  (n  —  (?)). 


As  a  consequence  we  can  permute  the  components  of  X  to  get  [X]G^  =  [Xi  \  X2]',  and  Y  = 
AX  —  A  \  X]  +  /1 2  X2,  and  finally  express  Xi  in  terms  of  X2  and  Y,  like  so: 

Xi=A^-(Y~A2X2) 

Proof  The  Gibbs  sampler  scheme  involves  iterative  sampling  from  the  full  conditional  distribu¬ 
tions  P(Zj|Z(_j)  =  X(_j)),  for  i  —  1, N  and  Z  vector.  A  sufficient  condition  to  ensure  the 
irreducibility  of  the  chain,  Besag  (1974),  requires  that  the  support  of  the  full  conditional  distribu¬ 
tions  is  positive  where  that  of  the  joint  distribution  of  Z  is  positive,  that  is: 

if  P(Zi  =  Zi ,  Z(_q  =  Z(_q)  >0  =»  P(Zj|Z(_j)  =  Z(_q)  >  0.  (A.l) 
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2D  case:  we  show  that  condition  A.l  holds.  Specifically  consider  the  situation  displayed  in  figure 
6  above,  where  there  are  k  —  i  =  2  components  of  X2  that  we  need  to  sample  from.  The  chain  is 
at  a  point  X2  >  0  where  the  joint  support  is  positive  and  Aj  1  {Y  —  A2X2)  >  0,  and  it  moves  by 
(+e,  +e)'  to  the  point  X2  +  (e,  e)'  where  the  joint  support  is  also  positive  and  A{l(Y  —  A2  (X2  + 
(e,  e)') )  >  0.  We  want  to  show  that  whenever  both  X2  and  X2  +  (e,  e)'  are  feasible,  it  is  possible 
to  pass  from  the  former  to  the  latter  by  means  of  component- wise  moves,  as  we  would  with  Gibbs 
moves;  that  is,  the  support  of  the  full  conditionals  must  be  positive  either  at  A^1(Y  —  A2  ( X2  + 
(0,  e)') )  or  at  A^iY  —  A2  ( X2  +  (e,  0)') ).  In  other  words  we  want  to  show  that 

{Ai\Y-A2X2)>  0  A  A~[l(Y  —  A2  (X2  +  (e,  e)') )  >  0  }  (A.2) 


implies 


{  A^1  (Y  -  A2  (X2  +  (e,  0)') )  >  0  V  A^(Y  -  A2  (X2  +  (0,  e)') )  >  0  }.  (A.3) 

Assume  that  A.2  holds.  Notice  that  A^{Y  —  A2  (X2  +  (e,  e)') )  =  Aj“1(F  —  A2X2  —  A I1)'  — 
e(A22,  A22)')  >  0.  Add  Aj"1(F  —  A2X2)  >  0,  non  negative  by  assumption,  and  rearrange  terms  to 
get  Aj_1(y  —  A2X2  —  e(A21,  A^1)')  +  Ajj1(F  —  A2X2  —  e(A22,  A|2)')  >  0  which  cannot  be  the 
sum  of  two  negative  quantities.  QED. 

Similar  derivations  show  that  whenever  the  joint  support  has  positive  probability  at  Aj“1(F  — 
A2  ( X2  —  (e,  e)') )  then  it  also  possible  for  the  chain  to  get  there  either  through  Aj“1(F  —  A2  ( X2  — 
(0,  e)') )  or  through  Aj_1(F  —  A2  (A"2  —  (e,  0)') );  and  that  the  same  condition  holds  as  we  consider 
the  moves  to  the  points  Aj'1(T'  —  A2  (X2  +  (e,  —  e)') )  and  Aj_1(F  —  A2  (X2  +  (— e,  e)') ). 


General  case:  the  proof  is  exactly  the  same  as  in  the  2D  case,  but  more  tedious.  Now  X2 
and  (e,  are  k  —  i  =  n-dimensional.  Assume  a  A^l(Y  —  A2X2)  >  0  and  Aj"1(y  — 
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A2  (X2  +  (e, e)') )  >  0  hold  true.  Rewrite  Ax  1{Y  -  A2  (X2  +  (e, e)') )  as  A1 1(Y  -  A2X2  - 
e(Al\  A$\  Af)'  -  ...  -  e(A'n,  Aln, A™)') )  >  0.  Add  (n  -  1)  x  Ay1  (Y  -  A2X2)  >  0,  non 
negative  by  assumption,  and  rearrange  terms  to  get  Ayx{Y  —  A2X2  —  e{A^ ,  A l1, A^1)')  +  ...  + 
A^iY  —  A2X2  —  e(Aln,  ...,  A 2n)')  >  0,  which  cannot  be  the  sum  of  n  negative  terms.  QED. 

Again  similar  derivations  show  that  condition  A.l  holds  as  we  consider  moves  to  other  points 
X2  +  (±e, ...,  ±e)'.  □ 
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Full  Conditionals  for  the  Gibbs  Sampler 


Say  6  =  (Ai, then  P(X,&)  =  [  Cl,  I’ih)  =  ]  I'iX,  A,,oi  P(X,)  l>(o). 

We  want  A*  G  (0,  oo)  and  0  G  (0,  oo).  As  an  example,  assume  priors  for  A*  and  1/0  proportional 
to  a  constant,  and  r  —  1.  Then,  noticing  that  P(0|X,  F)  =  P(0|X)  /{a- iy}(A),  the  following 
full  conditional  distributions  can  be  derived. 


P(\\X,Y) 


oc 


aP 


1  / 

 — ^ — 


P(0|A,F) 


CX  P(Xi\Xi,  0)  •  P(0) 


(X 


-Iv 

2^> 


/0+2 


lQg(x?;)-A7 
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In  order  to  compute  P(X\Y,  0)  we  use  the  fact  in  Appendix  A  to  conclude  that  P(X\Y,  0)  = 
P(X2\Y,  0)  x  xP{X1{X2)\Y,  0);  hence  for  Xt  G  X2  and  X,  G  A,  it  follows: 

P(X8|AM),Y,0)  cx  P(Aj|0)  •  P(Xi\Y,  0) 

=  log-Normalxi  (A*,  (p\ *)  •  f],  log-NormalXj  (Xj:  (pXj)  I{A-iY}(Xj) 

(B.l) 

In  the  analysis,  we  explored  the  various  posterior  distributions  using  the  Gibbs  sampler  with 
Metropolis  steps.  In  order  to  sample  from  P{Xi\Y1  0)  and  PiXpX,  Y),  we  used  x'2  and  Uni¬ 
form  proposals,  improper  priors  on  the  lambdas  (all  proportional  to  a  constant),  and  several  flavors 
for  the  improper  prior  on  (p  (proportional  to  a  constant,  to  4,  and  oto  4j). 


192 


Appendix  C 


Compendium  of  Network  Models 


Models  for  graphs  of  various  types  are  scattered  across  research  in  the  social,  physical,  mathemati¬ 
cal,  statistical,  and  computing  sciences.  In  this  review  of  the  literature,  I  emphasize  those  statistical 
models  that  attempt  to  express  the  dependencies  between  objects  in  the  system,  in  some  sense. 


C.l  Static  Graphs 


The  works  in  this  section  take  as  input  measurements  about  objects  that  start  from  a  network  as 
given. 

Exponential  Random  Graph  Models.  Under  the  assumption  that  two  possible  social  ties  are 
independent  only  if  a  common  actor  is  involved  in  both1  Frank  and  Strauss  (1986)  devised  the 
following  characterization  for  the  probability  distribution  of  undirected  Markov  graphs. 


n— 1 


(B.l) 


k= 1 


'This  is  the  intuitive  definition  of  Markov  property  for  spatial  processes  on  a  lattice  in  Besag  (1974). 
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where  the  statistics  Sk  and  T  count  specific  structures,  such  as  edges,  triangles,  and  k- stars,  {9k}  = 
6  and  r  are  the  parameters,  and  k>(<9.  r)  is  the  normalizing  constant.  Frank  and  Strauss  (1986) 
worked  mainly  with  the  three  parameter  models,  where  03, ... ,  6>n_i  =  0.  They  proposed  the 
pseudo-likelihood  estimation  method  to  estimate  the  complete  vector  of  parameters  by  maximizing 
the  following  pseudo-likelihood  function. 


t{Q)  =  ^2 log  ( p°  =  y*3 1 Yuv  =  y™ for  a11  u<yi  (u,  v)  (h  j)} )  •  (b.2) 


i<j 


Wasserman  and  Pattison  (1996)  proposed  the  current  formulation  of  Exponential  Random  Graph 
Models  (ERGM),  also  referred  to  as  p*  models,  as  a  generalization  of  the  Markov  graphs  of  Frank 
and  Strauss.  For  both  directed  and  undirected  graphs,  they  maintain  a  similar  characterization  of 
the  probabilities  where  the  statistics  Sk  and  T  are  substituted  for  arbitrary  statistics  U.  This  leads 
to  the  probability  functions  of  the  form 


(B.3) 


More  recently,  Snijders  et  al.  (2004)  have  proposed  a  variant  of  these  models  where  the  major  prob¬ 
lem  of  double-counting2  is  mitigated,  but  not  overcome.  Hunter  and  Handcock  (2004)  propose  an 
alternative  estimation  scheme  that  corrects  parameter  estimates  for  double-counting.  This  estima¬ 
tion  procedure  can  be  used  for  models  based  on  distributions  in  the  curved  exponential  family. 
Park  and  Newman  (2004)  formally  characterize  sensitivity  issues. 

Remark  A.  It  is  possible  to  express  the  current  formulation  of  exponential  random  graphs  using 
the  formalism  of  undirected  graphical  models,  let  us  write  the  likelihood  of  an  arbitrary  undirected 


2The  statistics  S,  (y)  count  graph  structures.  Although  they  are  not  independent,  i.e.,  they  count  overlapping  sets 
of  edges,  they  are  assumed  independent  in  the  pseudo-likelihood.  Ignoring  the  correlations  is  a  bad  idea,  and  causes 
extreme  sensitivity  of  the  predicted  number  of  edges  to  small  changes  in  the  value  of  certain  parameters. 
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graph. 


(B.4) 


where  xc  denotes  the  nodes  in  clique  c,  0,  denotes  the  corresponding  set  of  parameters,  0  are 
non-normalized  potentials  over  the  cliques,  and  z  =  Y2cec  ELec  0(*c|0c)  is  the  normalization 
constant.  If  the  likelihood  is  in  the  exponential  family,  we  can  write: 


Remark  B.  Models  in  this  family  are  not  “generative  models”  in  that  no  assumptions  are  present 
to  explain  how  the  sufficient  statistics  are  generated.  However,  it  is  possible  to  posit  a  generative 
model  that  includes  exponential  random  graph  models,  or  any  other  conditional  model,  as  part  of 
the  emission  model  (Airoldi  et  al.,  2006b). 

Latent  Variable  Models.  The  notions  of  equivalence,  structural  equivalence,  and  blocks  are 
introduced  by  Lorrain  and  White  (1971)  and  further  explored  by  many,  notably  by  Faust  (1988).  A 
comprehensive  treatment  of  models  that  use  blocks  to  express  the  complexity  of  the  data  is  given 
in  Doreian  et  al.  (2004).  A  summary  of  models  and  notions  relevant  for  social  networks  developed 
in  the  social  sciences  can  be  found  in  Wasserman  and  Faust  (1994). 

Stochastic  block  models,  the  probabilistic  treatment  of  blocks,  have  appeared  early  in  the 
statistical  sciences  (Holland  and  Leinhardt,  1975)  and  widely  studied  (Fienberg  and  Wasserman, 

1981;  Fienberg  et  al.,  1981;  Holland  et  al.,  1983;  Fienberg  et  al.,  1985;  Wang  and  Wong,  1987; 
Wasserman  and  Anderson,  1987;  Anderson  et  al.,  1992)  and  recently  rediscovered  (Snijders  and  Nowicki, 
1997;  Nowicki  and  Snijders,  2001),  including  non-parametric  treatment  of  the  number  of  blocks 
(Kemp  et  al.,  2004),  and  integration  with  non-relational  information  to  infer  the  blocks  (Wang  et  al., 

2005).  A  general  stochastic  block  model  of  mixed-membership  has  been  recently  proposed  (Airoldi  et  al., 
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2005b,  2006c),  along  with  a  framework  to  integrate  external  information  of  different  types  (Airoldi  et  al., 
2006b),  that  relaxes  the  historical  assumption  of  single-membership  of  objects  to  blocks,  and  esti¬ 
mates  block-to-block  connectivity  patterns  in  a  Bayesian  fashion. 

Remark  C.  A  general  framework  for  integration  of  a  different  nature  is  described  by  Carley 

(2002). 

An  alternative  approach  latent  space  models,  where  observed  interactions  are  projected  on  a  la¬ 
tent  space  through  a  generalized  linear  model  (Hoff  et  al.,  2002;  Hoff,  2003b, a).  Hoff  et  al.  (2002) 
use  MCMC  to  infer  latent  space  positions,  treated  as  hyper-parameters.  Hoff  (2003b)  specifies  a 
Gaussian  prior  over  the  latent  space,  thus  giving  to  the  model  fully  generative  flavor,  with  the  goal 
of  modeling  reciprocity. 

Remark  D.  It  is  possible  to  posit  a  generative  model  that  includes  generalized  linear  models 
as  part  of  the  emission  model  (Airoldi  et  al.,  2006b).  The  connection  between  stochastic  block 
models  (SBM)  and  latent  space  models  (LSM)  is  more  subtle,  though. 

Both  SBM  and  LSM  seek  to  define  a  conditional  probability  distribution  for  relations  {ynm} 
among  actors  in  a  way  that  reflects  some  latent  semantics  (i.e.,  roles,  topics,  functions,  etc.)  of  the 
actors.  Let  Zn  denote  a  latent  variable  capturing  the  latent  semantic  representation  of  the  actor  n, 
the  SBM  usually  defines  a  generative  model  ynm  ~  /( jZ„,  Zm),  of  which  the  Z’s  typically  act  as 
indicators  of  context-dependent  edge  generating  processes.  On  the  other  hand,  an  LSM  maps  the 
observed  relation  ynm  to  some  latent  semantic  differences  between  the  two  actors  via  a  regression 
function,  of  which  the  Z’s  typically  represent  the  projections  of  the  actors  onto  some  latent  metric 
space  where  their  differences  can  be  measured  via  a  Euclidian  metric.  Specifically,  in  LSM  the  Zs 
are  multivariate/continuous,  e.g.,  could  be  drawn  from  a  mv-normal,  and  their  realizations  indicate 
the  position  of  actors  in  the  latent  space.  In  SBM  the  Zs  are  multivariate/discrete,  e.g.,  could  be 
drawn  from  a  multinomial  (6),  and  their  realizations  indicate  which  group  an  actor  belongs  to, 
for  each  observed  interaction.  In  other  words,  the  dimensionality  of  the  Zs  in  SBM  reflects  how 
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many  latent  groups  to  be  captured  in  a  domain,  whereas  in  LSM  the  dimensionality  of  the  Zs  does 
not  have  an  explicit  interpretation  in  terms  of  groups.  In  fact,  in  LSM  we  need  to  run  a  clustering 
procedure  (e.g.,  k-means)  in  the  latent  space  where  the  actors  are  projected  to,  in  order  to  decide 
how  many  groups  there  are.  Thus,  the  two  types  of  network  models  are  different:  SBM  focuses 
on  latent  membership  of  each  actor  and  underlines  the  importance  of  modeling  the  ’’grouping”  of 
actors,  whereas  LSM  focuses  on  latent  distances  and  therefore  stress  more  on  modeling  proper 
projections  of  actors  into  a  latent  manifold.  Hoff’s  formulation  of  LSM  is  not  a  soft  version  of 
SBM.  As  a  results,  SBM  and  LSM  have  some  orthogonal  advantages  in  modeling  network  data. 

Remark  E.  Connections  have  been  highlighted  to  MDS  and  other  linear  methods  (Breiger  et  al., 

1975),  to  unsupervised  learning,  e.g.,  PCA,  FA,  (Ghahramani,  2004),  and  to  matrix  factorization 
(Ding,  2005;  Xing  and  Jordan,  2003). 

Spectral  Methods.  Research  on  by  Gaussian  unit  ensembles  provides  a  probabilistic  connec¬ 
tion  to  spectral  decompositions  (Metha,  2004).  In  the  computer  science  literature,  there  is  a 
stream  of  works  in  this  area  well  summarized  by  Saul  (2005),  who  discusses  comparison  to  PCA, 

MDS  and  other  linear  methods.  Briefly,  isomap  (Tenenbaum  et  al.,  2000),  local  linear  embed¬ 
ding  (Roweis  and  Saul,  2000),  laplacian  eigenmaps  (Belkin  and  Niyogi,  2002),  Hessian  eigenmaps 
(Donoho  and  Grimes,  2003),  maximum  variance  unfolding  (Weinberger  and  Saul,  2004;  Weinberger  et  al., 
2004;  Sun  et  al.,  2006),  conformal  eigenmaps  (Coifman  et  al.,  2005a, b;  Lafon  and  Lee,  2006)  and 
its  asymptotics  (Nadler  et  al.,  2005),  and  the  recent  reformulation  of  problems  and  solutions  in 
terms  of  tensors  (He  et  al.,  2005). 

Simple  Models  of  Real-World  Phenomena  and  their  Mathematical  Properties.  Much  of  the 
research  across  communities  concerns  the  study  of  real-world  graphs  and  their  properties  with  the 
aim  of  building  toy  models  that  capture  such  properties.  For  example,  Newman  and  Park  (2003) 
study  transitivity  and  assortative  mixing  (i.e.,  positive  correlation  of  degrees  of  adjacent  vertices) 
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via  group  structure;  Hoff  (2003a)  studies  transitivity,  reciprocity  and  balance;  Barabasi  (2005a) 
studies  burst  and  heavy  tails  in  human  dynamics;  Zheng  et  al.  (2005)  study  the  size  of  individu¬ 
als’  social  networks  and  means  of  estimating  them  from  a  certain  type  of  survey  questions;  and 
Ganesh  et  al.  (2005)  study  the  effects  on  epidemics  of  the  topological  properties  of  graphs. 

Research  originating  in  mathematics  and  physics  posit  simple  algorithms  for  generating  graphs 
that  replicate  observed  properties,  which  are  amenable  to  probabilistic  analysis.  Bollobas  and  Riordan 
(2003)  review  few  of  such  algorithms  for  popular  graph  types  (Barabasi  and  Albert,  1999;  Kumar  et  al., 

2000;  Cooper  and  Frieze,  2003),  and  present  an  extended  analysis  of  the  “LCD”  model  of  Buckley  and  Osthus 
(2004).  Other  notable  analytical  investigations  concern  sampling,  and  asymptotic  results.  Park  and  Newman 
(2004,  2005)  give  analytic  solutions  for  the  2-star  network  and  for  clustered  networks;  Milo  et  al. 

(2004c)  analyze  sampling  algorithms;  Kleinberg  and  Kleinberg  (2005)  describe  asymptotics  of 
isomorphism  and  embedding;  Stumpf  et  al.  (2005)  find  that  sub-samples  of  scale-free  graphs  are 
not  scale-free,  and  present  a  way  to  study  properties  of  a  sub-sample  based  on  moment  gener¬ 
ating  functions;  Flaxman  et  al.  (2005)  describe  the  behavior  of  high  degree  vertices  and  eigen¬ 
values  in  scale-free  graphs;  Chung  and  Lu  (2003)  characterize  average  distances  given  expected 
degrees;  and  Caldarelli  et  al.  (2004)  study  the  formation  of  cycles.  A  series  of  works  is  con¬ 
cerned  with  models  and  methods  to  find  “statistically  significant”  motifs,  i.e.,  recurring  edge  pat¬ 
terns  over  sets  of  difference  nodes  (Berg  and  Lassig,  2004;  Shen-Orr  et  al.,  2002;  Milo  et  al.,  2002; 
Artzy-Randrup  et  al.,  2004;  Milo  et  al.,  2004a;  Kashtan  et  al.,  2004;  Milo  et  al.,  2004b).  Newman 
(2003b)  portrays  networks  as  mixtures  of  patterns;  and  Vaszquez  et  al.  (2004)  present  the  only  in¬ 
vestigation  to  date  of  how  global  patterns  may  arise  from  the  composition  of  local  ones.  Few  com¬ 
prehensive  reviews  are  available,  which  summarize  many  of  these  findings  (Barabasi  et  al.,  1999; 

Albert  and  Barabasi,  2002;  Dorogovtsev  and  Mendes,  2002;  Newman,  2003a;  Amaral  and  Ottino, 

2004). 

A  notion  that  recently  captured  the  attention  of  funding  agencies  and  high  profile  journals  is 
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that  of  “topology  types”.  Airoldi  and  Carley  (2005)  present  a  review  and  a  critique  of  such  notion. 

They  survey  generative  algorithms  for  random  graphs  (Erdos  and  Renyi,  1960),  Poisson  graphs  and 
others  that  lead  to  heavy  tails  for  the  corresponding  degree  distributions  (Simon,  1955;  Bollobas, 

1985,  2001;  Barabasi,  2005a),  scale-free  graphs  (Faloutsos  et  al.,  1999;  Barabasi  and  Albert,  1999; 
Huberman  and  Adamic,  1999;  Adamic  and  Huberman,  2000;  Barabasi  et  al.,  2000;  Barabasi  and  Bonabeau, 
2003),  small-world  graphs  (Milgram,  1967;  Watts  and  Strogatz,  1998;  Kleinberg,  1999a;  Amaral  et  al., 
2000;  Liben-Nowell  et  al.,  2005),  core-periphery  graphs  (Borgatti  and  Everett,  1999),  and  cellular 
graphs  and  networks  (Frantz  and  Carley,  2005a;  Airoldi  and  Carley,  2006).  Several  of  these  topol¬ 
ogy  types  are  presented  in  heuristic  terms,  vaguely  consistent  across  communities3.  Airoldi  and  Carley 
(2005)  show  that  the  slight  differences  in  the  sampling  algorithms,  which  generate  topologies  that 
adhere  to  the  heuristic  requirements  of  a  specific  type,  are  not  stable  in  terms  of  the  topological 
properties  of  the  graphs  they  lead  to.  That  is,  slight  differences  in  the  operational  definitions  for 
the  same  topology  type  lead  to  separable  graphs  in  terms  of  the  set  of  common  metrics  used  by 
practitioners  in  the  various  communities4.  A  different  set  of  concerns  is  explored  in  “robustness” 
studies,  which  measure  the  stability  of  topological  properties  of  graphs  and  networks  of  specific 
types  to  disruption  and  other  stress  situations  (Borgatti  et  al.,  2005;  Frantz  and  Carley,  2005b). 

These  works  are  simulation  studies  that  approach  the  sub-sampling  issues  discussed  above  from 
another  perspective. 

Remark  F.  Alternatively,  being  able  to  embed  the  various  topology  type  in  a  smooth  para¬ 
metric  continuum  (e.g.,  Erdos  random,  to  small-world,  to  ring  lattice;  see  Watts  and  Strogatz, 

1998)  would  help  understanding  the  boundaries.  Unfortunately,  also  this  strategy  is  not  practical. 

There  is  a  potpourri  of  necessary  conditions  that  have  to  be  satisfied  by  such  a  smooth  parametric 

3 A  notable  survey  is  that  of  Mitzenmacher  (2004),  who  presents  a  brief  history  of  power-laws  and  lognormal 
distributions,  and  discusses  some  of  their  connections  from  a  generative  perspective.  Newman  (2005)  discusses  the 
connections  among  of  power-laws,  the  Pareto  distribution,  and  Zipf ’s  law.  Airoldi  and  Shalizi  (2006)  present  a  clear 
analytical  overview  of  these  connections. 

4Few  works  survey  network  metrics  and  visualization  tools;  notables  are  Carley  and  Reminga  (2004)  and  (Frank, 

2000). 
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continuum,  which  appear  in  the  heuristic  definitions  of  the  various  topology  types,  e.g.,  the  same 
degree  distribution  for  all  nodes  or  not,  or  shortest  path  as  the  only  notion  of  distance,  or  shortest 
path  and  metric  embedding.  Although  it  is  possible  to  posit  a  generative  process  that  satisfies  all 
the  necessary  conditions,  such  a  generative  process  relies  on  a  non-smooth  parametric  continuum. 

Specifically,  we  would  need  to  introduce  in  such  a  process  a  discrete  parameter  that  controls  the 
number  of  different  “probabilistic  treatments”  for  the  nodes,  e.g.,  the  number  of  degree  distri¬ 
butions.  The  problem  is  that,  on  one  hand,  the  value  of  such  a  discrete  parameter  is  difficult  to 
estimate.  On  the  other  hand,  its  correct  estimation  is  fundamental  in  correctly  assigning  the  topol¬ 
ogy  type.  Ultimately,  the  diversity  in  the  notions  of  topology  types  translates  into  the  hardness  of 
the  estimation  task,  upon  the  success  of  which  depends  our  ability  to  discriminate  among  types. 

To  summarize,  we  can  organize  the  various  works  according  to  few  aspects:  (a)  the  notion(s) 
of  distance  between  pairs  of  nodes  that  are  needed;  (b)  the  use,  or  not,  of  the  descriptive  statistics, 
as  well  as  their  nature,  i.e.,  local  versus  global;  (c)  the  existence  of  dependence  constraints  among 
neighborhoods;  (d)  the  focus  on  node  patterns  (groups)  versus  edge  patterns  (motifs),  where  we  do 
no  distinguish  similar  edge  configurations  among  different  sets  of  nodes.  These  aspects  have  to  be 
crossed  with  the  nature  of  the  models:  (i)  “generative”  models  and  algorithms,  both  probabilistic 
and  deterministic;  (ii)  models  and  algorithms  that  contain  “generative”  ideas,  both  probabilistic 
and  deterministic;  (iii)  other  models  and  algorithms. 

Problems.  More  works  have  introduced  methodological  innovations  in  the  context  of  specific 
problems.  Notable  research  in  this  sense  concerned  how  to  find  communities  in  networks  (Girvan  and  Newman, 
2002;  Newman,  2004a, b),  and  in  bipartite  graphs  (Mishra  et  al.,  2004).  To  this  extent,  Doreian  et  al. 

(2004)  summarize  relevant  works  in  the  social  sciences  and  develop  a  theory  of  generalized  block 
models.  A  cluster  of  research  is  about  link-mining  (Domingos,  2003;  Jensen,  1999;  Getoor,  2003; 

Getoor  and  Diehl,  2005),  graph  mining  (Chakrabarti,  2005),  link  prediction  (Getoor  et  al.,  2002; 
Liben-Nowell  and  Kleinberg,  2003;  Goldenberg  and  Moore,  2004),  and  link  ranking.  (Brin  and  Page, 
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1998;  Kleinberg,  1999b;  Cohen  et  al.,  1999;  Ng  et  al.,  2001).  Other  notable  works  are  concerned 
with  the  information  flow  within  a  network;  the  emergence  of  deadlines  (Papadimitriu  and  Servan-Schreiber, 
1999),  the  dynamics  of  information  (Kleinberg,  2001),  the  dynamics  of  information  exchange 
(Dodds  et  al.,  2003),  how  to  maximize  influence  spread  (Kempe  et  al.,  2003),  decentralized  in¬ 
formation  processing  (Van  Zandt,  1997),  and  decentralized  search  (Kleinberg,  2000,  2004).  A 
practical  set  of  concerns  inspired  methods  for  entity  disambiguation  (Malin  et  al.,  2005),  and  clas¬ 
sification  of  relational  data  (Macskassy  and  Provost,  2005).  Solan  et  al.  (2005)  propose  a  model  to 
learn  grammar-like  rules  in  natural  languages. 


Empirical  Studies.  Another  portion  of  research  concerns  findings  that  influenced  the  develop¬ 
ment  of  theoretical  aspects.  Notable  empirical  studies  include  the  web  (Faloutsos  et  al.,  1999; 

Albert  et  al.,  1999;  Kleinberg  and  Lawrence,  2001),  air  traffic  (Guimera  et  al.,  2005),  the  creative 

enterprise  (Barabasi,  2005b),  scientific  collaborations  (Newman,  2001),  metabolic  networks  (Guimera  and  Amaral 

2005),  decentralized  search  in  email  network  (Adamic  and  Adar,  2005),  transcriptional  regulatory 

network,  (Balazsi  et  al.,  2005),  words  (Steyvers  and  Tenenbaum,  2005),  the  organization  within 

the  cell  (Barabasi  and  Oltvai,  2004),  politics  (Porter  et  al.,  2005),  complex  brain  networks  (Sporns  et  al., 

2004),  and  more  (Newman  et  al.,  2002). 


C.2  Dynamics  and  Evolution 

Most  existent  works  focus  on  static  networks,  however,  there  are  few  that  consider  methodol¬ 
ogy  to  deal  with  dynamics  and  evolution.  Notables  are  the  stream  of  works  on  cellular  automata 
(Ilachinski,  2001),  the  early  works  on  diffusion  (Coleman  et  al.,  1957),  the  treatment  of  dynamics 
with  Markov-chains  Monte-Carlo  (Wasserman,  1980),  dynamic  random  fields  on  undirected  graph 
(Shalizi,  2003),  link-copying  processes  (Kleinberg  et  al.,  1999;  Leskovec  et  al.,  2005b, a),  cascad- 
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ing  behaviors  (Watts,  2002),  network  tomography  and  latent  allocation  (Airoldi  and  Faloutsos, 
2004),  dynamics  in  the  social  space  (Banks  et  ah,  2005),  and  models  that  attempt  at  replicat¬ 
ing  real-world  phenomena  such  as  opinion  formation  (Wu  and  Huberman,  2005),  and  evolution 
(Doreian  and  Stokman,  1997). 


Empirical  Studies.  Very  few  studies  exist  to  guide  theoretical  developments  in  a  dynamic  set¬ 
ting.  Few  notables  are  communication  networks  (Airoldi,  2003;  Airoldi  and  Faloutsos,  2004), 
email  networks  (Kossinets  and  Watts,  2006),  nucleic  acid  chain  dynamics  (Sales-Pardo  et  al.,  2005), 
and  scientific  collaborations  (Barabasi  et  al.,  2002). 


C.3  Building  Graphs  from  Data 

The  works  in  this  section  share  the  intuition  that  measurements  about  objects  are  inherently  noisy; 
the  various  authors  attempt  to  model  the  uncertainty  associated  with  the  measurements  in  or¬ 
der  to  make  decisions  whether  two  objects  are  related  or  not,  and  create  a  graph.  A  popu¬ 
lar  approach  is  that  of  associating  a  random  variable  with  each  “object”,  e.g.,  Bernoulli,  define 
the  process  through  which  “observations”  relate  to  binary  outcomes,  and  estimate  the  parame¬ 
ter  of  a  Bayesian  network  (Heckerman,  1999)  that  describes  the  observations  best,  through  de¬ 
pendencies  among  objects.  The  estimated  Bayesian  network  provides  a  probabilistic  model  for 
the  observed  co-occurrences  that  can  be  used  to  predict  missing  links,  or  to  assess  the  likeli¬ 
hood  of  existing  ones,  (Getoor  et  al.,  2002;  Friedman  and  Roller,  2003;  Heckerman  et  al.,  2004; 
Goldenberg  and  Moore,  2004;  Teyssier  and  Roller,  2005).  Important  applications  based  on  varia¬ 
tions  this  approach  have  been  used  for  building  recommender  systems  (Breese  et  al.,  1998),  social 
networks  (Breiger,  2003),  and  complex  cellular  networks  (Friedman,  2004;  Segal  et  al.,  2005). 
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C.4  Inadequacies  of  the  Current  Research 


There  are  several  dimensions  that  are  relevant  to  statistical  analyses  of  graphs  and  networks.  Un¬ 
fortunately  no  single  approach  develops,  or  at  least  allows  for,  all  of  them 

Dimensions  of  interest  are:  (a)  a  “proper”  likelihood  function;  (b)  the  fully  generative  nature  of 
the  model;  (c)  replicability  of  interesting  properties  at  both  global  and  local  level;  (d)  the  focus  on 
edges  or  nodes;  (e)  notions  of  distance  and  embedding  of  graphs  in  a  metric  space;  (f)  identifiability 
issues  that  need  be  explicitly  identified;  (g)  hierarchical  relations  between  dyads,  triangles,  k- stars, 
/c-triangles  and  other  basic  structural  (connected)  components  that  are  used  to  summarize  and 
characterize  an  observed  graph;  (h)  dependencies  among  relevant  quantities,  i.e.,  the  sufficient 
statistics,  corresponding  to  a  decomposition  of  the  observed  graph  into  cliques  or  other  structures 
of  interest  need  be  identified;  (i)  goodness  of  fit  must  be  assessed — current  models  tend  to  over-fit 
observed  graphs  and  can  not  be  easily  extended  as  the  observed  network  grows;  (j)  the  possibility 
of  integrating  data  on  different  object  types;  and  other  dimensions. 

With  respect  to  this  last  dimension,  i.e.,  integration  of  multiple  object  types  and  data  about 
them,  most  existent  work  tend  to  concern  or  assume  specific  types  of  data  representations,  e.g., 
temporal  and  sequential  data  in  attribute  space,  or  relational  data  represented  by  graphs  or  net¬ 
works.  We  view  learning  problems  along  this  line  as  “type-specific-leaming”  problems.  Typi¬ 
cally,  one  can  develop  solutions  to  type-specific-learning  problems  by  devising  novel  domain-  and 
data-specific  models  and  algorithms  that  leverage  domain  knowledge  and  semantics  of  interest  for 
particular  applications.  Integrating  heterogeneous  data  types  under  a  unified  model  remains  a  chal¬ 
lenge,  however,  especially  for  complex  graphs  that  are  simultaneously  described  by  intrinsically 
different  types  of  characteristics,  such  as  features  in  attribute  space  and  links  in  relational  space 
(Airoldi  et  al.,  2006b). 

As  we  discussed  in  the  previous  section,  there  is  a  wide  range  of  research  questions  that  an 
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elegant  solution  to  the  issues  above  may  help  us  answer.  It  is  useful  to  keep  those  questions  in 
mind  in  order  to  guide  our  technical  choices. 
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